Data-adaptive trimming of the Hill estimator and detection of outliers in the extremes of heavy-tailed data

We introduce a trimmed version of the Hill estimator for the index of a heavy-tailed distribution, which is robust to perturbations in the extreme order statistics. In the ideal Pareto setting, the estimator is essentially finite-sample efficient among all unbiased estimators with a given strict upper break-down point. For general heavy-tailed models, we establish the asymptotic normality of the estimator under second order regular variation conditions and also show it is minimax rate-optimal in the Hall class of distributions. We also develop an automatic, data-driven method for the choice of the trimming parameter which yields a new type of robust estimator that can adapt to the unknown level of contamination in the extremes. This adaptive robustness property makes our estimator particularly appealing and superior to other robust estimators in the setting where the extremes of the data are contaminated. As an important application of the data-driven selection of the trimming parameters, we obtain a methodology for the principled identification of extreme outliers in heavy tailed data. Indeed, the method has been shown to correctly identify the number of outliers in the previously explored Condroz data set.


Introduction
The estimation of the tail index for heavy-tailed distributions is perhaps one of the most studied problems in extreme value theory. Since the seminal works of [27,32,24] among many others, numerous aspects of this problem and its applications have been explored (see e.g., the monographs [21] and [7]).
Let X 1 , · · · , X n be an i.i.d. sample from a distribution F . We shall say that F has a heavy (right) tail if: for some ξ > 0 and a slowly varying function : (0, ∞) → (0, ∞), i.e., (λx)/ (x) → 1, x → ∞, for all λ > 0. The parameter ξ is referred to as the tail index of F . Its estimation is of fundamental importance to the applications of extreme value theory (see for example the monographs [7], [17], [33], and the references therein). The fact that the tail index ξ governs the asymptotic right tail-behavior of F means that, in practice, one should estimate it by focusing on the most extreme values of the sample. In many applications, one may quickly run out of data since only the largest few order statistics are utilized. Since every extreme data-point matters, the problem becomes even more challenging when a certain number of these large order statistics are corrupted. Contamination of the top order statistics, if not properly accounted for, can lead to severe bias in the estimation of the tail index. For example, the right panel of Figure 1 shows the classic Hill plot, its biased version and our new trimmed Hill plot for a data set which has been previously identified to have 6 outliers (see [34,36] and Section 5, below, for more details). We shall elaborate more on the construction of these three plots 1 in the rest of the introduction but observe the drastic difference in the tail-index estimates produced by these methods.
Recall the classic Hill estimator of ξ: It is based on the top-k of the order statistics: X (n,n) ≥ X (n−1,n) ≥ · · · ≥ X (1,n) of the sample X i , i = 1, · · · , n.
1 https://shrijita-apps.shinyapps.io/adaptive-trimmed-hill/ Naturally, one can trim a certain number of the largest order statistics in order to obtain a robust estimator of ξ. This idea has already been considered in Brazauskas and Serfling [13], who (among other robust estimators) defined a trimmed version of the Hill estimator: c k 0 ,k (i) log X (n−i+1,n) X (n−k,n) , 0 ≤ k 0 < k < n. (1.3) where the weights c k 0 ,k (i) were chosen so that the estimator is asymptotically unbiased for ξ (see Section 3.1 in [13]). The weights used by Brazauskas and Serfling, however, are not optimal. In Section 2.1, we show that the asymptotically optimal trimmed Hill estimator has the form Note that if k 0 = 0 the trimmed Hill estimator ξ k 0 ,k coincides with the classic Hill estimator. A number of authors have also considered trimming but of the models rather than the data. Specifically, the seminal works of [2] and [6] studied the case where the distribution is truncated to a potentially unknown large value. In contrast, here we assume to have non-truncated heavy-tailed model and trim the data as a way of achieving robustness to outliers in the extremes.
Suppose now that somehow one has identified that the top-k 0 order statistics have been corrupted. Following [29], if one were to simply ignore them and apply the classic Hill estimator to the observations X (n−k 0 ) ≥ · · · ≥ X (n−k,n) , the estimator would be biased. Indeed, the second summand, ξ 0 k 0 ,k (n) in (1.4) gives the expression for this biased Hill estimator. The recent work of Zou et al [38] uses this biased Hill estimator in a different inferential censoring-type context, where an unknown number k 0 of the top order statistics is missing.
Let us return to Figure 1 (right panel). It shows the classic Hill plot, i.e., the plot of ξ k (n) as a function of k as well as the plots of ξ k 0 ,k (n) and ξ 0 k 0 ,k (n) as a function of k. We refer to the last two plots as to the trimmed Hill and biased Hill plots, respectively. Since the data exhibits six outliers, the trimmed Hill and biased Hill plots are based on k 0 = 6. The significant difference in the three plots demonstrates the effect that outliers can have on the estimation of the tail index.
In this paper, we introduce and study the trimmed Hill estimator ξ k 0 ,k (n) defined in (1.4). We begin by establishing its finite sample optimality and robustness properties. Specifically, for ideal Pareto data, we establish in Theorem 2.5 that the trimmed Hill estimator is nearly minimum-variance among all unbiased estimators with given strong upper break-down point (see Definition 2.4). Since the Pareto regime emerges asymptotically, it is not surprising that the trimmed Hill estimator is also minimax rate-optimal. This was shown in Theorem 3.2 for the Hall class of heavy-tailed distributions. Furthermore, under technical second-order regular variation conditions, we establish the asymptotic normality of the trimmed Hill estimator in Section 3.2.
The optimality and asymptotic properties of the trimmed Hill estimator although interesting are not practically useful unless one has a data-adaptive method for the choice of the trimming parameter k 0 . This problem is addressed in Section 2.2. We start by introducing diagnostic plot 2 to visually determine the number of outliers k 0 . It is a plot of the trimmed Hill estimator as function of k 0 for a fixed k. Figure 1 (middle panel) displays this plot for a real data set. A sudden change point at k 0 = 6 further corroborates the hypothesis of six plausible outliers in the data set. This value of k 0 was automatically identified by the method we introduce in Section 2.2. The methodology 2 for the automatic selection of k 0 is based on a weighted sequential testing method, which exploits the elegant structure of the joint distribution of ξ k 0 ,k (n), k 0 = 0, 1, . . . , k − 1 in the ideal Pareto setting. In Section 3.2, we show that this test is asymptotically consistent in the general heavy-tailed regime (1.1) under second order conditions on the regularly varying function of [4]. In fact, the resulting estimator ξ k 0 ,k (n), where k 0 is automatically selected, has an excellent finite sample performance and it is adaptively robust. This novel adaptive robustness property is not present in other robust estimators of [20,23,28,13,31,14], which involve hard to select tuning parameters. Also none of these estimators is able to identify outliers in the extremes, a property inherent to the adaptive trimmed Hill estimator. An R shiny app implementing the trimmed Hill estimator and the methodology for selection of k 0 is available on https://shrijita-apps.shinyapps.io/adaptive-trimmed-hill/.
The paper is structured as follows. In Section 2, we study the benchmark Pareto setting. We establish finite-sample optimality and robustness properties of the trimmed Hill estimator. We also introduce a sequential testing method for the automatic selection of k 0 . Section 3 deals with the asymptotic properties of the trimmed Hill estimator in the general heavy-tailed regime. The consistency of the sequential testing method is also studied. In Section 4, the finite-sample performance of the trimmed Hill estimator is studied in the context of various heavy tailed models, tail indices, and contamination scenarios. In Sections 4.3, 4.4 and 4.5, we demonstrate the need for adaptive robustness and the advantages of our estimator in comparison with established robust estimators in the literature. In Section 5, we demonstrate the application of the adaptive trimmed Hill methodology to the Condroz data set and French insurance claim settlements data set.

Optimal and Adaptive Trimming: The Pareto Regime
In this section, we shall focus on the fundamental Pareto(σ, ξ) model and assume that for some σ > 0 and a tail index ξ > 0. Motivated by the goal to provide a robust estimate of the tail index ξ, we consider trimmed versions of the classical Hill estimator in Relation (1.2) and thereby study the class of statistics, ξ trim k 0 ,k (n) as in Relation (1.3). Proposition 2.1 below finds the optimal weights, c k 0 ,k (i) for which the estimator in 2 https://shrijita-apps.shinyapps.io/adaptive-trimmed-hill/ Relation (1.3) is not only unbiased for ξ, but also has the minimum variance. This yields the trimmed Hill estimator of Relation (1.4). Its performance for general heavy-tailed models is discussed in Section 3.

The Trimmed Hill estimator
The following result gives the form of the trimmed Hill estimator, which is indeed the best linear unbiased estimator (BLUE) among the class of estimator in Relation (1.3) Proposition 2.1. Suppose X 1 , · · · , X n are i.i.d. Pareto(σ, ξ) random variables, as in Relation (2.1). Then, among the general class of estimators given by Relation (1.3), the minimum variance linear unbiased estimator of ξ is given by The proof is given in Section 6.2.
Remark 2.2. The second summand, ξ 0 k 0 ,k (n) in Relation (2.2) is nothing but the classic Hill estimator applied to the observations X (n−k 0 ,n) ≥ · · · ≥ X (n−k,n) which denote the top k ordered statistics excluding the top k 0 ones. Note that, ξ 0 k 0 ,k (n) which belongs to the class of estimators in Relation (1.3), is not only suboptimal but also biased for the tail index ξ. We shall thus refer to it as the biased Hill estimator. The biased Hill estimator has been previously used for robust analysis (see [36]) and inference in truncated Pareto models (see [29], [38]). Remark 2.3 (Classic, Biased and Trimmed Hill Plots). The classic Hill plot is a plot of the classic Hill estimator, ξ k (n) as function of k. Likewise, for a fixed k 0 , a plot of the trimmed Hill estimator, ξ k 0 ,k (n) and the biased Hill estimator, ξ 0 k 0 ,k (n) as function of k will be referred to as the trimmed Hill plot and the biased Hill plot, respectively. Since ξ 0 k 0 ,k (n) ≤ ξ k 0 ,k (n), the biased Hill plot always lies below the trimmed Hill plot. Depending upon the nature of outliers in the extremes, the classic Hill plot can either lie above or below the trimmed Hill plot (see Figures 1 and 11).
In the rest of the section, we discuss the robustness and finite-sample optimality properties of the trimmed Hill estimator. In this direction, inspired by [13], we define the notion of strict upper breakdown point.
In Proposition , we showed that the trimmed Hill estimator is the BLUE for a large class of estimators with strict upper break down point of k 0 /n (see Relation (1.3)). We next prove a stronger result on the finite sample near-optimality of the trimmed Hill estimator. As stated in the next proposition, the trimmed Hill estimator is essentially the minimum variance unbiased estimator (MVUE) among the class of all tail index estimators with a given strict upper break down point.
Theorem 2.5. Consider the class of statistics given by which are all unbiased estimators of ξ with strict upper breakdown point β = k 0 /n. Then for ξ k 0 ,n−1 (n) as in Relation (2.2), we have In particular, ξ k 0 ,n−1 (n) is asymptotically MVUE of ξ among the class of estimators described by U k 0 .
The proof is given in Section 6.3.
Though the trimmed Hill estimator has nice finite sample properties, it is of limited use in practice unless the value of trimming parameter k 0 is known. In the following section, we will develop a data-driven method for the estimation of k 0 .

Automated Selection of the Trimming Parameter
In this section, we introduce a methodology for the automated data-driven selection of the trimming parameter k 0 . The trimmed Hill estimator with this estimated value of k 0 will be referred to as the adaptive trimmed Hill estimator. Its performance as a robust estimator of the tail index ξ is discussed elaborately under Section 4. In addition, the k 0 -estimation methodology also provides a tool for the detection of outliers in the extremes of heavy tailed data.
We begin with a result on the joint distribution of the trimmed Hill statistics, which is a starting point towards the estimation of k 0 .
Proposition 2.6. The joint distribution of ξ k 0 ,k (n) can be expressed as follows: The proof is given in Section 6.2. This result motivates a simple visual device for the selection of k 0 .
Diagnostic Plot. For a fixed value of k, the plot of ξ k 0 ,k (n) as a function of of k 0 will be referred to as a trimmed Hill diagnostic plot. Figure 2, shows diagnostic plots for simulated data in the cases of no outliers (left panel) and k 0 = 5 outliers (right panel). The vertical lines correspond to ξ k 0 ,k (n)+ σ k 0 ,k (n), where σ k 0 ,k (n) = ξ k 0 ,k (n)/ √ k − k 0 is the plug in estimate of the standard error of ξ k 0 ,k (n) (see Proposition 2.6). In the absence of outliers, modulo variability, the diagnostic plot should be constant in k 0 (see left panel in Figure 2). The right panel in Figure 2 corresponds to a case where extreme outliers have been introduced by raising the top k 0 = 5 order statistics to a power greater than 1. This resulted in a visible kink in the diagnostic plot near k 0 = 5. Note that, in principle, the presence of outliers could lead to a kink/or change point with an upward or downward trend in the left part of the plot. The diagnostic plot, while useful, requires visual inspection of the data. In practice, an automated procedure is often desirable.
The crux of our methodology for automated selection of k 0 lies in the next result. The idea is to automatically detect a change point in the diagnostic plot by examining it sequentially from right to left. Formally, this will be achieved by a sequential testing algorithm involving the ratio statistics introduced next. Proposition 2.7. Suppose all the X i 's are generated from Pareto(σ, ξ). Then, the statistics are independent and follow Beta(k − k 0 − 1, 1) distribution for k 0 = 0, 1, · · · , k − 2.
Remark 2.8. Note that, T k 0 ,k (n) depends only on X (n−k 0 ,n) , · · · , X (n−k,n) . Therefore, the joint distribution of T k 0 ,k (n)'s remains the same as long as (X (n−k 0 ,n) , · · · , X (n−k,n) ) where Y (n,n) > · · · > Y (1,n) are the order statistics of n i.i.d. observations from Pareto(σ, ξ). In other words, Proposition 2.7 holds even in the presence of outliers provided that they are confined only to the top-k 0 order statistics. This motivates the sequential testing methodology discussed next.
Weighted Sequential Testing. By Proposition 2.7, in the Pareto regime, the statistics (2.10) are i.i.d. U (0, 1). This follows from the simple observation that T k−k 0 −1 k 0 ,k (n) ∼ U (0, 1). For simplicity, both in terms of notation and computation, we use the transformation in Relation (2.10) to switch from beta to uniformly distributed random variables.
Assuming that outliers affect only the top-k 0 order statistics, one can identify k 0 as the largest value j for which U j,k (n) fails a test for uniformity. Specifically, we consider a sequential testing procedure, where starting with j = k − 2, we test the null hypothesis H 0 (j) : U j,k (n) ∼ U (0, 1) at level α j . If we fail to reject H 0 (j), we set j = j − 1 and repeat the process until we either encounter a rejection or j = 0. The resulting value of j is our estimate k 0 . The methodology is formally described in the following algorithm.
Since α j varies as a function of j, we refer to Algorithm 1 as the weighted sequential testing algorithm. The family wise error rate of the algorithm is well calibrated at level q ∈ (0, 1), provided (2.11) Proposition 2.9. For i.i.d. observations from Pareto(σ, ξ), let k 0 be the value from Algorithm 1 with α j as in Relation (2.11). Then, under the null hypothesis H 0 : k 0 = 0, we have P H 0 ( k 0 > 0) = q.
Remark 2.10 (Choice of α j ). For the purposes of this paper, the levels α j in the above algorithm are chosen as follows: with a > 1 and c = 1/ k−2 j=0 a k−j−1 . This choice of α j satisfies Relation (2.11), which in view of Proposition 2.9, ensures that the algorithm is well calibrated. In addition, this choice puts less weight on large values of j and thereby allows for a larger type I error or fewer rejections for the hypothesis H 0 (j) : U j,k (n) ∼ U (0, 1). This implies that large values of j are less likely to be chosen over smaller ones. This guards against encountering spurious values of k 0 close to k, which can lead to highly variable estimates of ξ k 0 ,k (n). Our extensive analysis with a variety of sequential tests indicate that the choice of levels as in Relation (2.13) with a = 1.2 works well in practice.
Remark 2.11. Proposition 2.9 shows that in the Pareto case the weighted sequential testing algorithm is well calibrated and attains the exact level type I error. In the general heavy tailed regime, Theorem 3.10 (below) establishes the asymptotic consistency of the algorithm. In Section 4, we show that the algorithm can identify the true k 0 in the ideal Pareto regime as well as the challenging cases of Burr and T distributions (see Section 4.5).

The General Heavy Tailed Regime
In this section, we study the asymptotic properties of the trimmed Hill statistics for a general class of heavy-tailed distributions F as in Relation (1.1). Consider the tail quantile function corresponding to F , defined as follows: Following [4], for F as in Relation (1.1), one can equivalently assume that Remark 3.1. The relation between the slowly varying functions and L in Relations (1.1) and (3.2) is well known (see e.g., [4], [33] and [12]). Specifically, one can show that ). This in view Theorem 1.5.13 of [11] implies that 1/˜ is the de Bruijn conjugate ofL and hence unique up to asymptotic equivalence.
We start with a conceptually important derivation used in the rest of the section. Using the tailquantile function, one can express the trimmed Hill statistic for under the general heavy-tailed model (1.1) as the sum of a trimmed Hill statistic based on ideal Pareto data plus a remainder term. More precisely, in view of (3.1) and (3.2), any i.i.d. sample X i , i = 1, . . . , n from F can be represented as: where Y (i,n) 's are the order statistics for the Y i 's. Since Y ξ i 's follow Pareto(1, ξ), the statistic ξ * k 0 ,k (n) in (3.4) is simply the trimmed Hill estimator for ideal Pareto data and R k 0 ,k (n) is a remainder term that encodes the effect of the slowly varying function L.
The nature of the function L determines the rate at which the remainder term R k 0 ,k (n) converges to 0 in probability. We establish minimax rate optimality of the trimmed Hill estimator under the Hall class of assumptions on the function L (see Section 3.1). To establish the asymptotic normality of the trimmed Hill estimator, we use second order regular variation conditions on the function L (see Section 3.2). Under the same set of conditions, the asymptotic consistency of the weighted sequential testing algorithm is also established in Section 3.3.

Minimax Rate optimality of the Trimmed Hill Estimator
Here, we study the rate-optimality of the trimmed Hill estimator for the class of distributions in D := D ξ (B, ρ), where Relation (3.2) holds with tail index ξ > 0 and L of the form: for constants B > 0 and ρ > 0 (see also Relation (2.7) in [12]). This is known as the Hall class of distributions.
In [25], Hall and Welsh showed that no estimator can be uniformly consistent over the class of distributions in D at a rate faster than or equal to n ρ/(2ρ+1) . Theorem 1 of [25] adapted to our setting and notation is as follows: Theorem 3.2 (optimal rate). Let ξ n be any estimator of ξ based on an independent sample from a distribution F ∈ D ξ (B, ρ). If we have then lim inf n→∞ n ρ/(2ρ+1) a(n) = ∞. Here by P F , we understand that ξ n was based on independent realizations from F .
In Theorem 3 of [25], it is shown that for the case of no outliers, the classic Hill estimator, ξ k with k = k(n) ∼ n 2ρ/(1+2ρ) is a uniformly consistent estimator of ξ at a rate greater than or equal to any other uniformly consistent estimator. In other words, the classic Hill estimator is minimax rate optimal in view of Theorem 3.2 wherein ξ n = ξ k(n) satisfies (3.6) for every a(n) with a(n)n ρ/(2ρ+1) → ∞.
Note that, Theorem 3.2 also applies to the trimmed Hill estimator. We next show that in the presence of outliers, the trimmed Hill estimator with k = k(n) ∼ n 2ρ/(1+2ρ) is minimax rate optimal with the same rate as that of the classic Hill. In addition, the minimax rate optimality holds uniformly over all Then, for every sequence a(n) ↓ 0, such that a(n) k(n) → ∞, we have The proof of this result is given in Section 6.3.1. Observe that k(n) ∝ n ρ/(1+2ρ) is the optimal rate in Theorem 3.2. Therefore, Theorem 3.3 implies that ξ k 0 ,k (n) is minimax rate-optimal in the sense of Hall and Welsh [25]. Also, note that the trimmed Hill estimator ξ k 0 ,k (n) is uniformly consistent with respect to both the family of possible distributions D as well as the trimming parameter k 0 , provided Remark 3.4. The above appealing result shows that trimming does not sacrifice the rate of estimation of ξ so long as k 0 = o(n 2ρ/(2ρ+1) ), n → ∞. In the regime where the rate of contamination k 0 exceeds n 2ρ/(2ρ+1) , to achieve robustness and asymptotic consistency, one would have to choose k(n) n 2ρ/(2ρ+1) , which naturally leads to rate-suboptimal estimators. In this case, similar uniform consistency for the trimmed Hill estimators can be established along the lines of Theorem 3.3.

Asymptotic Normality of the Trimmed Hill Estimator
Here, we shall establish the asymptotic normality of ξ k 0 ,k under the general semi-parametric regime (1.1) or equivalently (3.2). In Proposition 2.6, we already established the asymptotic normality of the trimmed Hill estimator in the Pareto regime. Recalling Relation (3.4), we observe that ξ k 0 ,k (n) differs from a tail index estimator based on Pareto data only by a remainder term R k 0 ,k (n). Thus, proving the asymptotic normality of ξ k 0 ,k (n) amounts to controlling the asymptotic behavior of the remainder term.
Indeed, we begin with a much stronger result which establishes the convergence rate of To this end, following [4], we adopt the following second order condition on the function L: for all ε > 0 and some t ε dependent on ε and g : (0, ∞) → (0, ∞) is a −ρ varying function with ρ ≥ 0 (see Lemma A.2 in [4] for more details.) Theorem 3.5. Suppose the X i 's are independent realizations with tail quantile function Q as in Relation (3.2) with L as in Relation (3.8). If, for some δ > 0 and constant A > 0, The proof is given in Section 6.3.2.
The asymptotic normality of ξ k 0 ,k (n) is a direct consequence of Theorem 3.5 with δ = 1/2 and Relation (2.5). This is formalized in the following corollary.
Remark 3.7. Consider the asymptotic normality result of Corollary 3.6 for the Hall class of distributions in Relation (3.5). In this case, we have g(x) ∝ x −ρ and the convergence √ kg(n/k) → A > 0 implies that k = k(n) ∝ n 2ρ/(2ρ+1) , as n → ∞. This is the optimal rate, which as we know from Theorem 3.3, cannot be achieved by an asymptotically unbiased estimator of ξ. Indeed, the limit distribution in (3.11) involves the bias term cA/(ρ + 1). To eliminate the bias term, one can pick k = o(n 2ρ/(2ρ+1) ), which in this case implies that √ kg(n/k) → A ≡ 0. That is, asymptotically unbiased estimators can be obtained but one needs to sacrifice the optimal rate.

Asymptotic behavior of the Weighted Sequential Testing
In this section, we establish the asymptotic consistency of the weighted sequential testing algorithm under the same set of second order regular variation conditions on the function L as in Section 3.2. We begin with a convergence result on the ratio statistics of Relation (2.6).
Theorem 3.8. Assume that the conditions of Theorem 3.5 hold. Then, for δ > 0 in Relation (3.9), we have where T k 0 ,k (n) and T * k 0 ,k (n) are based on ξ k 0 ,k (n) and ξ * k 0 ,k (n), respectively as in Relation (2.6).
The proof is described in Section 6.3.3.
, the T k 0 ,k (n)'s converge in distribution to ratio statistics of Pareto. Note that the order of k is same as that needed for the asymptotic normality of the trimmed Hill estimator (see Remark 3.7).
We next establish that the weighted sequential testing algorithm is well calibrated and attains the significance level q even for the general class of heavy tailed models in (3.8).
Theorem 3.10. Assume that the conditions of Theorem 3.5 hold for some δ ≥ 1. Then, for δ as in Relation (3.9), we have with U k 0 ,k (n) and U * k 0 ,k (n) based on T k 0 ,k (n) and T * k 0 ,k (n), respectively as in Relation (2.10). Moreover, if the conditions of Theorem 3.5 hold for some δ ≥ 2, then The proof is given in the Section 6.3.3.
Remark 3.11. To illustrate the above result, we consider the Hall class of distribution where g(x) ∼ x −ρ , as x → ∞. In this case, for a given value of δ, the order of k which satisfies (3.9) for A > 0 is given by n ρ/(ρ+δ) . The asymptotic normality and minimax optimal rate for the trimmed Hill statistic, ξ k 0 ,k (n) is obtained for δ = 1/2 (see Remark 3.7). However, for the U k 0 ,k (n)'s to converge and the type I error of the weighted sequential testing algorithm to be controlled, we need δ ≥ 1 and δ ≥ 2, respectively. This would in turn produce suboptimal choices of k in terms of rate. If ρ is large, the difference between these suboptimal values of k and the optimal value n ρ/(ρ+1/2) is negligible. For small values of ρ, the difference is greater and the consistency of the algorithm is compromised. However, in Section 4.5, we show that even for smaller values of ρ, we do a reasonably good job in terms of determining the true number of outliers k 0 .
4 Performance of the Adaptive Trimmed Hill Estimator.

Simulation Set Up
In this section, we study the finite sample performance of the adaptive trimmed Hill estimator, ξ k 0 ,k (n), which is the trimmed Hill statistic in Relation (2.2) with k 0 = k 0 (see also https://shrijitaapps.shinyapps.io/adaptive-trimmed-hill/). Here, the value of the trimming parameter k 0 is obtained from the weighted sequential testing algorithm in Section 2.2. We also evaluate the accuracy of the algorithm 3 as an estimator of the number of outliers k 0 .
Measures of Performance: The performance of an estimator ξ of ξ is evaluated in terms of its root mean squared error ( Usign criterion (4.1), we evaluate the performance of the adaptive trimmed Hill estimator and several other competing estimators of the tail index ξ. The computation of the √ M SE is based on 2500 independent monte carlo simulations.
Data generating models: We generate n i.i.d. observations from one of the following heavytailed distributions: In Theorem 3.3, we showed that the trimmed Hill estimator is also optimal at the same rate as the classic Hill estimator as long as the number of outliers, k 0 = o(k). Since for Pareto ρ = ∞, the optimal k is n−1 where n is the sample size. In Section 4.2, we demonstrate the performance of the adaptive trimmed Hill estimators in the regime of no outliers. In this scenario, the classic Hill estimator (recall Relation (1.2)) is an asymptotically optimal estimator of ξ (see [27]) and is therefore used as the comparative baseline.
Outlier Scenarios: In Sections 4.3, 4.4 and 4.5, we demonstrate the performance of the adaptive trimmed Hill estimator in the presence of outliers. We next discuss the mechanism of outlier injection which introduces outliers in the extreme observations of the data as follows: 1. Exponentiated Outliers: The top k 0 order statistics are perturbed as follows: 2. Scaled Outliers: The top k 0 order statistics are perturbed as X (n−i+1,n) := X (n−k 0 ,n) + C(X (n−i+1,n) − X (n−k 0 ,n) )), i = 1, · · · , k 0 , Thus, observations above a given threshold τ are perturbed.
Note that, all three nature of outliers preserve the order of the bottom (n − k 0 ) order statistics. The exponentiated and scaled outliers preserve the order of the top k 0 order statistics as well. The case of mixed outliers is a challenging one because the trimming parameter k 0 , though controlled by τ , is random and not well defined. In contrast, k 0 is fixed and well defined for exponentiated and scaled outliers. Thus, for exponentiated and scaled outliers, we demonstrate the efficiency of the weighted sequential testing algorithm in determining k 0 .

Competing Robust Estimators:
In the presence of outliers, the adaptive trimmed Hill estimator is indeed a robust estimator of the tail index ξ. Thus, for a comparative baseline we use two other robust estimators of the tail index in Sections 4.3, 4.4 and 4.5. These are the optimal B-robust estimator of [37] and the generalized median estimator of [13]. These estimators are indexed by two different ARE values viz 78% and 94% to allow for varying degrees of robustness. The constant c which serves as a bound on the influence function (IF) controls for the degree of robustness for optimal B-robust estimator (see Relations (2) and (3) in [37]). The values c = 1.63 and c = 2.73 result in 78% and 94% asymptotic relative efficiency (ARE) values for the optimal B-robust estimator. Similarly, the parameter κ which controls for the subset size in defining the generalized median statistic controls for the degree of robustness for generalized median estimator (see Relation 2.2 in [13]). Indeed, the values κ = 2 and κ = 5 produces ARE values 78% and 94% respectively for the generalized median estimator. Other robust estimators of the tail index like the probability integral transform statistic estimator of [22] and the partial density component estimator of [35] were also considered but their results have been omitted for brevity.

Case of No Outliers
For the three distribution models in Relation (4.2), we report the performance of the adaptive trimmed Hill estimator (ADAP) under the regime of no outliers. The classic Hill estimator (HILL) is used as the comparative baseline. Figure 3    We observe that for a wide range of k, the ADAP is virtually indistinguishable from the HILL irrespective of the distribution under study. This indicates that the weighted sequential testing algorithm can precisely determine k 0 = 0 for the same wide range of k-values as in Figure 3. Indeed, Table 1 shows that the algorithm attains the nominal significance level of q = P( k 0 > 0) = 0.05. This encouraging finite sample performance complements the theoretically established consistency of the algorithm in Theorem 3.10.

Adaptive Robustness
In this section, we study how the presence of outliers in the data influences the performance of the adaptive trimmed Hill estimator (ADAP) and the weighted sequential testing algorithm. For clarity and simplicity, the data in this section are generated from Pareto as in Relation (4.2) with σ = 1, ξ = 2 for varying sample sizes n = 100, 300, 500.
The value of k is fixed at n − 1 which is indeed the optimal k for the Pareto regime (see Relation The figures also show an intriguing adaptive robustness property of our estimator. Namely, its √ M SE is nearly flat and grows slowly with increase in the degree of contamination (parametrized by either the number of outliers k 0 in Figures 4 and 5 or the threshold τ in Figure 6). On the other hand, the competing estimators break down completely with increase in the degree of contamination. This can be explained as: the competing estimators must be calibrated to a predefined level of robustness by setting their ARE level in advance. To the best of our knowledge, none of the existing works in the literature provide a data-driven method for selecting this optimal ARE value. In contrast, the trimming parameter k 0 involved in the ADAP is estimated from the data itself which allows it to adapt itself to unknown degrees of contamination in the data. Figures 4 and 5 show that whenever the target ARE value is greater than (1 − k 0 /n) × 100%, the performance of the ADAP is much superior to that of the competing estimators. For example, the OBRE-94 and the GME-94 breakdown completely where 1 − k 0 /n ≤ 0.9 (n = 100, k 0 ≥ 15 and n = 300, k 0 ≥ 30). Similarly, the performance of the OBRE-78 and the GME-78 is drastically poor where 1 − k 0 /n ≤ 0.7 (n = 100, k 0 ≥ 30). An estimator indexed by a higher ARE value has greater efficiency provided ARE ≤ (1 − k 0 /n)%. This explains why the performance of the OBRE-78 and the GME-78 is quite poor in comparison to that of the OBRE-94 and the GME-94 where 1 − k 0 /n ≤ 0.95 (n = 100, k 0 ≤ 5 and n = 300, k 0 ≤ 15). By automatically estimating the number of outliers, ADAP not only produces an estimator of ξ robust to varying levels of data contamination but also provides a methodology for outlier detection in the extremes of heavy tailed models.   Indeed, Tables 2 and 3 which produce the mean and standard errors of k 0 for outliers injected by mechanisms (4.4) and (4.5), show that for all values of n, the weighted sequential testing algorithm picks up the true number of outliers k 0 for almost all values k 0 (exception is k 0 = 2 for scaled outliers).  Table 3: E( k 0 ) ± Standard Error( k 0 ) for Pareto(1,2) with scaled outliers, C = 200.

Impact of Outlier Severity and Tail Index.
In this section, we study the influence of the magnitude of outliers and tail index on the performance of the adaptive trimmed Hill estimator (ADAP) for Pareto observations with sample size n = 500. The conclusions were similar for other heavy tailed models explored. We begin with the impact of outlier severity on the performance of the ADAP. For outlier generating mechanisms in Section 4.1, the outlier severity is controlled by the parameters L, C and M . The data generating model is Pareto as in Relation (4.2) with σ = 1 and ξ = 2. Figure 7 produces a plot of the √ M SE for ADAP for outlier generating mechanisms in Relations (4.4), (4.5) and (4.6) with k 0 = 10, τ = 5000 and varying L, C and M . For comparison, √ M SE values for the optimal B-robust estimator (OBRE) and the generalized median (GME) at 78% and 94% ARE levels have also been included. The ADAP performs better than both the OBRE and the GME for almost all values of L, C and M no matter what their ARE levels is. The only exception is C = 10 for the case scaled outliers (see Relation (4.5)). Though more robust, the estimators the OBRE-78 and the GME-78 perform poorly at lower levels of contamination in the data. This explains their inferior behavior at n = 500, k 0 = 10 where the degree of contamination is only 5%.  Table 4: E( k 0 ) ± Standard Error( k 0 ) for Pareto (1,2) with k 0 = 10 outliers.
The superiority of the ADAP grows with increase in the magnitude of the outliers. For exponentiated and scaled outliers, the increase in magnitude is manifested through increasing values of | log(L)| and | log(C)|, respectively 4 . For mixed outliers, the increase in magnitude occurs with the increase in the value of M . With an increase in magnitude, the weighted sequential testing algorithm can correctly detect the true number of outliers k 0 (see Table 4) and hence the greater efficiency of ADAP.
We next study the impact of the tail index ξ on the performance of ADAP. The data generating model is Pareto as in Relation (4.2) with σ = 1 and varying values of ξ. Outliers are injected according to Relations (4.4), (4.5) and (4.6) with k 0 = 10, τ = 5000, L = 3, C = 200 and M . Figure 8 produces a plot of the √ M SE values for the ADAP along with those of the OBRE and the GME at 78% and 94% ARE levels. The performance of the ADAP is superior to that of the remaining estimators. For exponentiated and mixed outliers, the improvement is even more prominent at larger values of ξ. This is because for the same values of L and M , the severity of outliers is greater for heavier tails (ξ = 2.5) than lighter ones (ξ = 0.5). In contrast, for scaled outliers, the improvement is more prominent at smaller ξ values. This is because for the same value of C, the severity of outliers is greater for lighter tails than heavier ones. This is in consensus with the findings of Table 5 where the accuracy of the weighted sequential testing algorithm in correctly estimating the true number of outliers improves with increase in ξ for exponentiated and mixed outliers and deteriorates with increase in ξ for scaled outliers. 10.00 ± 0.59 10.01 ± 1.18 9.98 ± 0.72 9.99 ± 1.05 9.79 ± 1.47 Table 5: E( k 0 ) ± Standard Error( k 0 ) for Pareto (1,ξ) with k 0 = 10 outliers for L = 3 and C = 200.
Due to their slow rate convergence to Pareto tails, both Burr and |T| are difficult cases to analyze. For the Burr distribution with ρ = 1, the rate of convergence is further slower than that of the |T| with ρ = 2. However, the ADAP performs well even in this challenging regime. This can be attributed to the accuracy of the weighted sequential testing algorithm which correctly identifies true number of outliers k 0 irrespective of the distribution under study for a wide range of k-values (see Tables 6 and  7).

Application
In this section, we apply our weighted sequential testing algorithm and adaptive trimmed Hill estimator to real data. Two data sets have been explored in this context. The first one provides the calcium content in the Condroz region of Belgium [34] (also analyzed in https://shrijitaapps.shinyapps.io/adaptive-trimmed-hill/)). The data is indeed heavy tailed and has already been explored in the works of [5] and [36]. The second data set involves insurance claim settlements [10]. Both these data sets on analysis revealed the presence of outliers in the extremes and are therefore suitable for the application of our methodology.  [34] which measures the calcium content of soil samples together with their pH levels in the Condroz region of Belgium. As in [36], the conditional distribution of the calcium content for pH levels lying between 7-7.5 have been considered. The left and middle panels use the value of k = 85 based on the k opt value from [36]. The left panel displays a pareto quantile plot [5] of the data where an apparent linear trend indicates Pareto distributed observations. Nearly six data points show up as outliers in the pareto quantile plot. This has already been observed in [34] but no principled methodology for the identification of such outliers has been proposed. Our trimmed Hill estimator (recall Relation (2.2)) diagnostic plot in the middle panel also shows a change point in the values of the trimmed Hill statistics at k 0 = 6. On applying the weighted sequential testing algorithm with type I error q = 0.05, we formally identify exactly k 0 = 6 outliers for this data set 6 . This is in consensus with the findings of [34] and [36].

Condroz Data set
The right panel in Figure 1 displays the values trimmed Hill estimator as a function of k for k 0 = k 0 = 6. Also displayed as a function of k are the values of the estimators, classic Hill and biased Hill with k 0 = 6 (recall Relations (1.2) and (1.4)). The robust estimator of ξ as reported in the analysis of [36] is same as that of the biased Hill. When compared with the trimmed Hill, the classic Hill plot produces much larger estimates and the biased Hill plot produces much smaller estimates of the tail index ξ. This can be explained by the apparent upward trend in the outliers as shown in left and middle panels of Figure 1. Thus, ignoring the presence of outliers by either using the classic Hill estimator or by naively truncating them and using the biased Hill statistics can lead to large discrepancies in the tail index values. The trimmed Hill estimator with k 0 = 6, which is in fact our adaptive trimmed Hill estimator discussed in Section 4.1, produces more credible estimates of the tail index ξ.

French Claims Data Set
Next, we consider a data set of claim settlements issued by a private insurer in France for the time period 1996-2006 from [10]. We investigate the payments of claim settlements for the year 2006. Figure  11 produces exploratory plots of this data where the left and middle panels use the value of k = 130. The left panel displays a pareto quantile plot [5] of the data where an apparent linear trend indicates Pareto distributed observations as well as a large number of outliers. Nearly thirty three data points show up as outliers in the pareto quantile plot. This is further confirmed by the diagnostic plot in the middle panel where a change point in the values of trimmed Hill statistics is evident at k 0 ≈ 33. On applying the weighted sequential testing with q = 0.05, we identify k 0 ≈ 33 outliers for this data set 6 .
In contrast to the case of Condroz data set ( Figure 1 right panel), now the both classic and biased Hill plots lie under the trimmed Hill plot (see the right panel of Figure 11 constructed with k 0 = 33 and varying k). This can be explained by the apparent downward trend in the outliers as shown in left and middle panels of Figure 11.
Observe that the trimmed Hill plot in Figure 11 (right panel) has a rather high peak for k close to k 0 , but then it quickly stabilizes around the value of 2, when k grows. It is well-known that except in the ideal Pareto setting, the classic Hill plot can be quite volatile for small values of k (see Figure 4.2 in [33]). The same holds for the trimmed Hill plots, but ultimately, in Figure 11 for a wide range of k's the trimmed Hill plot is relatively stable and it provides more reliable estimates of ξ than the classic and biased Hill plots therein. This simple analysis shows that ignoring or not adequately treating extreme outliers can lead to significant underestimation of the tail index ξ. This in turn can result in severe underestimation of the tail of loss distribution with detrimental effects to the insurance industry. 6 Appendix 6.1 Auxiliary Lemmas (1), j = 1, 2, · · · , n + 1 be standard exponential random variables. Then, the Gamma(i, 1) random variables defined as and Γ n+1 are independent. (6.2) and where U (1,n) < · · · < U (n,n) are the order statistics of n i.i.d. U(0,1) random variables.
For details on the proof see Example 4.6 on page 44 in [3]. The next result, quoted from page 37 in [16], shall be used throughout the course of the paper to switch between order statistics of exponentials and i.i.d. exponential random variables. Lemma 6.2 (Rényi, 1953). Let E 1 , E 2 , · · · , E n be a sample of n i.i.d. exponential random variables with mean ξ (denoted by Exp(ξ)) and E (1,n) ≤ E (2,n) ≤ E (n,n) be the order statistics. By Rényi's (1953) representation, we have for fixed k ≤ n, where E * 1 , · · · , E * k are also i.i.d. Exp(ξ).
Lemma 6.5. If E i , i = 1, · · · , n are i.i.d. observations from Exp(ξ), the best linear unbiased estimator (BLUE) of ξ based on the order statistics, E (1,n) < · · · < E (r,n) is given by denote the BLUE of ξ. By Relation (6.4), the BLUE can then be expressed asξ where the E * j are i.i.d. from Exp(ξ) and δ j = (n − j + 1) r i=j γ i For i.i.d. observations from Exp(ξ), the sample mean is the uniformly minimum variance unbiased estimator (UMVUE) for ξ (see Lehmann Scheffe Theorem, Theorem 1.11, page 88 in [30]).
Thus, δ j = 1/r yields the required best linear unbiased estimator and therefore, the weights γ i 's have the form: This completes the proof.
Proof of Theorem 2.5. Assume that σ is known and consider the class of statistics: Since σ is no longer a parameter, every statistic in U σ k 0 can be equivalently written as a function of log(X (n−i+1,n) /σ), i = k 0 + 1, · · · , n as follows: ∼ Pareto(σ, ξ) .
Since X i 's follow Pareto(σ, ξ), log(X i /σ) ∼ Exp(ξ) and therefore log X (n−k 0 ,n) σ , · · · , log where E (1,n) ≤ · · · ≤ E (n,n) are the order statistics of n i.i.d. observations from Exp(ξ). Therefore where the E i 's do not depend on σ. Next, using Relation (6.4) of Lemma 6.2, we have Using the above result with (6.16), we get ∼ Exp(ξ) . (6.17) where the first equality is in the sense of finite dimensional distributions. By (6.17), we have inf T ∈U σ k 0 is uniformly the minimum variance estimator (UMVUE) of ξ among the class described by V k 0 , L can be easily obtained as The fact that E * n−k 0 is the UMVUE follows because it is an unbiased and complete sufficient statistic for ξ (see Lehmann Scheffe Theorem, Theorem 1.11, page 88 in [30]).
To complete the proof, observe that every statistic T in U k 0 is an unbiased estimator of ξ for any arbitrary choice of σ. This implies that for any σ, T ∈ U σ k 0 and therefore L ≤ Var(T ). Since this holds for all values of T ∈ U k 0 , the proof of the lower bound in (2.3) follows.
For the upper bound in (2.3), we observe that ξ k 0 ,n−1 ∈ U k 0 , which in view of Proposition 2.6 implies inf T ∈U k 0 This completes the proof.
The proof of Relation (2.5) is a direct application of central limit theorem to Relation (2.4).

Minimax Rate Optimality
Our goal is to establish the uniform consistency in Relation (3.7). To this end, recall the representation in Relation (3.4). For the Hall class of distributions in Relation (3.5), it can be shown that √ kR k 0 ,k is O P (1) (see Lemma 6.6 below). With √ k|R k 0 ,k | bounded away from infinity, it is easier to bound the quantity This shall form the basis of the proof for Theorem 3.3 as shown next.
This completes the proof.

Asymptotic Normality
Proof of Theorem 3.5. To prove Relation (3.10), we observe that with S k 0 ,k defined as 29) where Y i 's are i.i.d observations from Pareto(1,1) as in (3.4).
We will show that the right hand side of (6.28) vanishes as k → ∞. To this end, we first show that k δ max 0≤k 0 <h(k) |R k 0 ,k − S k 0 ,k | P −→ 0 as follows: Additionally, ∆ 2k P −→ 0 by Lemma 6.7 and follows from Relation (6.37) and assumption (3.9). Thus, the bound in (6.30) goes to 0 as k → ∞.
Next we show that the second term in the right hand side of (6.28) also vanishes. Indeed, where k δ g(Y (n−k,n) ) P −→ A as in (6.31) and 1 − h(k)/k → 1.
where R k 0 ,k and S k 0 ,k are defined in Relations (3.4) and (6.29), respectively.
Therefore, over the event From Relation (6.10), we have Y (n−k,n) which completes the proof.
Case ρ = 0: As in the previous case, over the event {Y (n−k,n) > t ε }, by Relation (3.8) we have Since Y (n−i+1,n) ≥ Y (n−k 0 ,n) for i = 1, · · · , k 0 + 1, we further obtain over the event {Y (n−k,n) > t ε }. The upper bound in (6.35) can be bounded by 2ε over the event We have already proved that P(Y (n−k,n) > t ε ) → 1. Thus, to complete the proof of Relation (6.32), it only remains to show that ) ε < 2 → 1. (6.36) In this direction, from Relation (6.10), we observe that where the last convergence follows from weak law of large numbers. Thus, Relation (6.36) holds as long as ε < 0.5.
This completes the proof for ρ = 0.
where the last convergence is a direct consequence of Relation (6.6). Finally to prove Relation (6.49), we shall equivalently show that for every subsequence {k l }, there exists a further subsequence k such that which equivalently implies Note that both the sequences a k and b k converge to as k → ∞. Thereby, taking limsup w.r.t k on both sides of Relation (6.53), we get lim sup k→∞ k δ−1 max 0≤k 0 <h( k) Since Relation (6.54) holds for all > 0 and ω ∈ Ω with P(Ω) = 1, we have This entails the proof of the convergence in probability of Relation (6.49). Proof of Relation (3.15). To this end, we show that P H 0 ( k 0 = 0) → 1 − q. We first provide an upper bound on P H 0 ( k 0 = 0) as follows.