1 Introduction

Extreme value analysis of heavy-tailed distributions is important for various applications. In seismology and climatology, for example, statistics of extremes are used to study earthquakes (Beirlant et al. 2018) or heavy precipitation (Carreau et al. 2017). Another important field of research is analysing high financial losses, which becomes particularly interesting if the losses depend on covariates (Chavez-Demoulin et al. 2016; Hambuckers et al. 2018).

To investigate the behaviour of heavy tails, we consider random variables from the max-domain of attraction (DoA) of a Fréchet distribution. Let \(X_{1},\dots ,X_{n} \) be independent and identically distributed (i.i.d.) random variables with a distribution function F, where F is in the DoA of an extreme value distribution (evd) Gγ with extreme value index γ > 0. This means there exist sequences an > 0 and bn real, s.t.

$$ \begin{array}{@{}rcl@{}} \lim_{n\rightarrow\infty} F^{n}(a_{n} x+b_{n}) = G_{\gamma}(x) : = \exp\left( -x^{-1/\gamma}\right). \end{array} $$

In this situation the following first order condition holds,

$$ \begin{array}{@{}rcl@{}} \underset{t\rightarrow\infty}{\lim}\ \frac{1-F(tx)}{1-F(t)}=x^{-1/\gamma},\quad x>0, \end{array} $$
(1)

i.e. the survival function 1 − F is regularly varying with index − 1/γ. Distributions fulfilling this condition are called Pareto-type distributions, because they only differ from the Pareto distribution by a slowly varying function F(x), i.e. 1 − F(x) = x− 1/γF(x).

We can interpret the quotient in Eq. 1 as a conditional probability, from which it follows directly that

$$ \begin{array}{@{}rcl@{}} \frac{X_{1}}{t} \Big| X_{1}>t\ \overset{\mathcal{D}}{\longrightarrow}\ & P, \text{ as } t\rightarrow\infty \text{ and } P\sim \text{Pareto}\left( 1,\frac{1}{\gamma}\right), \\ \log\left( \frac{X_{1}}{t}\right)\Big| X_{1}>t\ \overset{\mathcal{D}}{\longrightarrow}\ & E, \text{ as } t\rightarrow\infty \text{ and } E\sim \text{Exp}\left( \frac{1}{\gamma}\right), \end{array} $$
(2)

where \(\text {Pareto}\left (1,1/\gamma \right )\) denotes the Pareto distribution with the scale parameter 1 and the shape parameter 1/γ. Thus, for a sufficiently large threshold t the data above this threshold can be modelled by a Pareto or an exponential distribution. In this article we concentrate on the exponential approximation and utilize it for inference on the extreme value index. It is common to consider the threshold t = X(nk,n) and choose the sample fraction k instead of t, where X(1,n) ≤⋯ ≤ X(n,n) denote the order statistics of a sample of size n. In this case, a natural estimator for γ under the exponential approximation of the log-spacings \(Y_{(i,k)}:=\log (X_{(n-i+1,n)})-\log (X_{(n-k,n)})\) is their mean, the Hill estimator (Hill 1975),

$$ \hat{\gamma}_{k} := \frac{1}{k}\sum\limits_{i=1}^{k} \log\left( \frac{X_{(n-i+1,n)}}{X_{(n-k,n)}} \right) = \frac{1}{k}\sum\limits_{i=1}^{k} Y_{(i,k)}. $$
(3)

The Hill estimator is still among the most popular and well-known estimators for the extreme value index, although its sample path as a function in k can be highly unstable and estimation, therefore, crucially depends on the choice of the sample fraction k. This dependence highlights the difficulties in estimating γ in the peaks-over-threshold approach. To select a threshold above which the data can be used for statistical inference in the tail is one of the most fundamental problems in the field of extreme value analysis.

Due to the importance of this task, an appropriate choice of the threshold has been discussed extensively in extreme value research over the last decades and suggested solutions cover a variety of methodologies. We give a short summary of different types of approaches and stress the specific difficulties that arise. We mainly concentrate on the methods considered in the simulation study in Section 4. More comprehensive reviews about threshold selection can be found in Scarrott and MacDonald (2012) and Dey and Yan (2016).

One basic concept in threshold selection is data visualisation, which is also discussed more deeply in Kratz and Resnick (1996) and Drees et al. (2000). Popular graphical diagnostics used in this context are the Zipf plot, Hill plot, QQ-plot or the mean-excess plot, to name a few. A major drawback of these methods is their subjectivity due to the necessarily personal interpretation of the plot. Further, it is a burden to choose each threshold manually, especially in high dimensional settings or when analysing many samples. Easier ways to select the sample fraction are rule-of-thumb approaches such as using the upper 10% of the data (DuMouchel 1983) or \(k=\sqrt {n}\) (Ferreira et al. 2003). Reiss and Thomas (2007) present a procedure that tries to find a region of stability among the estimates of the extreme value index. Their method depends on a tuning parameter, whose choice is further analysed in Neves and FragaAlves (2004). To our knowledge, no theoretical analysis exists for this approach.

Besides these, and similar, heuristic approaches, there is a class of theoretically motivated procedures that target the optimal sample fraction for specific estimation tasks, such as quantile estimates (Ferreira et al. 2003), estimation of high probabilities (Hall and Weissman 1997) or the Hill estimator, see below. Further, some suggestions compare the empirical distribution to the fitted generalized Pareto distribution (GPD) via goodness-of-fit tests (Northrop and Coleman 2014; Wadsworth 2016; Bader et al. 2018) or by minimizing the distance between them (Pickands 1975; Gonzalo and Olmo 2004; Clauset et al. 2009), where the latter approach is theoretically analysed in (Drees et al. 2018). Alternatively, Bayesian procedures can be considered for threshold selection, e.g., the cross-validation approach suggested in Northrop et al. (2017) in the context of ocean storm severity.

Of particular interest to us are methods that aim to estimate the sample fraction \(k_{\text {opt}} = \arg \min \limits _{k>0}\mathbb {E}(\hat {\gamma }_{k}-\gamma )^{2}\) which minimizes the asymptotic mean squared error (AMSE) of the Hill estimator. To construct an estimator for kopt, Drees and Kaufmann (1998) utilize the Lepskii method and an upper bound on the maximum random fluctuation of \(\hat {\gamma }_{k}\) around γ. To apply their approach it is necessary to choose several tuning parameters and to obtain consistent initial estimates for γ and a second-order parameter ρ. In Guillou and Hall (2001), a test statistic is constructed based on an accumulation of log-spacings, which is tested against a critical value that has to be chosen. Danielsson et al. (2001) introduce a double bootstrap approach to estimate the optimal sample fraction. They need to choose the number of bootstrap samples and a parameter n1. For n1, a data-driven, but computationally expensive selection method is provided, where the whole bootstrap procedure is repeated for various possible values of n1. Another estimator for kopt is given in Beirlant et al. (2002), which employs least squares estimates from an exponential regression approach. The method depends on an estimate for ρ and a sample fraction k0. A different approach is taken in Goegebeur et al. (2008), where the properties of kernel test statistics regarding the estimation of bias are used in order to construct an estimator for the AMSE/γ and minimize it with respect to k. Here the choice or estimation of the second-order parameter ρ is needed.

In this paper, we contribute to the problem of threshold selection by introducing two new and easily applicable methods, which do not require the user to manually choose any tuning parameters. The first method presented in Section 2 is inspired by the idea of testing the exponential approximation. We estimate the integrated square error (ISE) of the exponential density under the assumption that the log-spacings are indeed exponentially distributed. The error functional we obtain, called the inverse Hill statistic (IHS), is very easy to compute and does not depend on any tuning parameters. Since this criterion is variable for small k, it can be additionally smoothed to improve the performance. The minimizing oracle sample fraction of IHS is asymptotically smaller than kopt, as it is stricter against deviation from an exponential approximation. However, this estimator performs remarkably well in small samples, as illustrated in our simulation study.

In the second approach, we suggest a smooth estimator for the AMSE of the Hill estimator, called SAMSEE (smooth AMSE estimator). This estimator is based on a preliminary estimate of γ and an estimate of the bias of the Hill estimator. For the preliminary gamma estimate we use a generalized Jackknife approach in Gomes et al. (2000). By minimizing SAMSEE we estimate the optimal sample fraction kopt. For estimation, the choice of a large sample fraction K is necessary, for which we present a data-driven selection procedure in Section 3. SAMSEE utilizes the idea of fixing ρ = − 1, which is justified by good performance in simulations and leads to a simpler and more robust estimator than employing an estimate \(\hat {\rho }\).

After introducing the two novel threshold selection methods in Sections 2 and 3 we compare these methods to various other approaches in a numerical analysis in Section 4. In Section 5 the importance of automated threshold selection procedures is illustrated in an application, where we non-parametrically estimate an extreme value index that varies over time. This example illustrates how the new methods enable the selection of a threshold that depends on covariates.

The proof of Theorem 3, which describes the asymptotic behaviour of the bias estimator, as well as the auxiliary results, can be found in Appendix A. Appendix A provides additional results from our simulation study.

2 Inverse Hill statistic

In this section, we introduce the first threshold selection procedure by analysing the integrated square error (ISE) between the exponential density hγ and its parametric estimator \(h_{\hat {\gamma }_{k}}\) employing the Hill estimator,

$$ \text{ISE}(k):= \int \left\{h_{\gamma}(x) -h_{\hat{\gamma}_{k}}(x)\right\}^{2}\mathrm{d}x =\frac{1}{2\gamma}-\frac{2}{\gamma +\hat{\gamma}_{k}} + \frac{1}{2\hat{\gamma}_{k}} , $$

where \(h_{\gamma }(x):=\frac {1}{\gamma }e^{-x/\gamma }\) and \(\hat {\gamma }_{k}\) is the Hill estimator. The first term of ISE is constant and thus plays no role for selecting k. The last term of ISE is known, but the second term is not. Therefore, we cannot minimize ISE directly. Instead, we want to estimate and minimize its expectation under the exponential approximation. This is based on the idea of considering the hypothesis H0 that the log-spacings Y(i,k) are indeed exponentially distributed. Under H0 the Hill estimator is gamma distributed as the sum of independent exponential random variables and the mean of ISE (MISE) can be calculated explicitly. We observe that MISE is a decreasing function in k under the exponential approximation,

$$ \text{MISE}(k)-\frac{1}{2\gamma}:= \mathbb{E}_{H_{0}}[\text{ISE}(k)]-\frac{1}{2\gamma} =-\frac{1}{\gamma}C(k)+\frac{k}{2(k-1)\gamma} , $$
(4)

where \(C(k):=2\exp (k) k^{k} {\Gamma }(1-k,k)\) and \({\Gamma }(a,b):={\int \limits }_{b}^{\infty } t^{a-1}e^{-t} \mathrm {d}t\) denotes the upper incomplete gamma function. One can show that C(k) ≈ 1 + 1/(4k), i.e. it converges to 1 very fast, s.t. we obtain

$$ \lim_{k\rightarrow\infty} \mathbb{E}_{H_{0}}\left[ \frac{2}{\gamma+\hat{\gamma}_{k}}\right] = \frac{1}{\gamma} = \mathbb{E}_{H_{0}}\left[\frac{k-1}{k\hat{\gamma}_{k}}\right] . $$

This provides an estimator for the first term in Eq. 4 under H0. However, owing to its high variability for small k, we instead want to find an estimator of the form \(w/\hat {\gamma }_{k}\) for some w depending on k that minimizes the MSE of that estimator under the exponential approximation. To do so, we approximate its MSE in the following way,

$$ \begin{array}{@{}rcl@{}} \mathbb{E}_{H_{0}}\left[\left( \frac{w}{\hat{\gamma}_{k}}-\frac{2}{\hat{\gamma}_{k}+\gamma}\right)^{2}\right] \approx \frac{w^{2}k^{2}}{\gamma^{2}(k-1)(k-2)} -\frac{2wk}{\gamma^{2}(k-1)} + \frac{1}{\gamma^{2}}. \end{array} $$
(5)

The approximation depends on similar functions as C(k), which quickly become constant. The MSE in Eq. 5 is minimized for w = (k − 2)/k. Thus, we suggest the inverse Hill statistic

$$ \begin{array}{@{}rcl@{}} \text{IHS}(k) := \frac{1}{2\hat{\gamma}_{k}} -\frac{k-2}{\hat{\gamma}_{k} k} = \frac{4-k}{2\hat{\gamma}_{k} k} \end{array} $$

to estimate MISE(k) − (2γ)− 1 and the sample fraction selected via minimizing IHS, \(\hat {k}_{\text {IHS}} := \arg \min \limits _{1<k<n}\{ \text {IHS}(k)\}.\) By minimizing IHS we select a sample fraction where IHS starts increasing and thereby contradicts the behaviour of MISE under H0. This criterion can be compared to hypothesis testing with a high significance level α, which implies seeking high confidence when deciding to not reject H0. Further properties of \(\hat {k}_{\text {IHS}}\) are analysed theoretically in Section 2.1 and for finite samples in a numerical study in Section 4.

Figure 1 illustrates that IHS exhibits high fluctuations for small k, which might make the automated selection of the minimum highly variable. To control this problematic behaviour we smooth IHS. More specifically, we want to estimate \(\mathbb {E}[\text {IHS}]\) by considering the regression problem

$$ \begin{array}{@{}rcl@{}} \text{IHS}(k) = \mathbb{E}[\text{IHS}(k)] + \sigma\epsilon_{k},\ \ k=1,\dots,n, \end{array} $$

where σ > 0 and \(\mathbb {E}[\epsilon _{k}]=0\). Due to the structure of the Hill estimator, the random variables 𝜖k are highly dependent, which needs to be taken into account in estimation. In our simulations, we apply a Bayesian non-parametric procedure introduced by Serra et al. (2018) to obtain a smooth estimator for the expectation of IHS – denoted as sIHS – comprising less variation for small k, as illustrated in Fig. 1, and thereby improving automatic threshold selection. The approach implemented in the eBsc R-package is completely data-driven and non-parametric: both mean and covariance structure of the data are estimated non-parametrically with the smoothing parameters chosen from the data. Alternatively, one can employ any other smoothing technique, which accounts for dependence in the error (Opsomer et al. 2001; Krivobokova and Kauermann 2007; Lee et al. 2010). A very good choice is the well-established function gamm from the mgcv package. However, for this procedure some parameters need to be fixed in advance. We experimented with several settings and found that choosing the number of knots to be at least 40 (or more) and specifying the correlation structure to corAR1() delivers results very much comparable with the eBsc function.

Fig. 1
figure 1

On the left, the IHS (dashed) and sIHS (red) are plotted for a Fréchet(2) sample of size n = 500. On the right the Hill plot for the same sample with the minimizing k of IHS (black) and sIHS (red) is shown, where the dotted line marks the true value of γ = 0.5

We finally want to remark on the relation between IHS and ISE, which is given by

$$ \text{IHS} +\frac{1}{2\gamma}= \text{ISE} +\frac{2}{k\hat{\gamma}_{k}} +\frac{\hat{\gamma}_{k}- \gamma}{\hat{\gamma}_{k}(\hat{\gamma}_{k}+\gamma)}. $$
(6)

This equation points out that minimizing IHS does not minimize ISE, as IHS takes an additional bias term into account, and that the sign of the bias of \(\hat {\gamma }_{k}\) plays a role. If the bias is negative, we suggest using

$$ \begin{array}{@{}rcl@{}} \text{IHS}^{-}(k) := \frac{4+k}{2\hat{\gamma}_{k} k}\quad \text{and} \quad \hat{k}_{\text{IHS}^{-}} := \arg\min_{1<k<n}\ \text{IHS}^{-}(k) \end{array} $$

instead of IHS, where the two cases can easily be distinguished by analysis of the Hill estimator for large k. If the bias of the Hill estimator is positive, which is the standard case, we can see from Eq. 6 that IHS selects smaller k (larger thresholds) than ISE. This is not surprising, because we estimate the expectation of the ISE under the hypothesis that the exponential approximation holds. This is a much more conservative error functional, meaning it is more strict against deviation from the exponential distribution.

In conclusion, with IHS we do not aim to estimate kopt but to find a sample fraction where we can be very certain that the exponential approximation still holds. The performance of sIHS in small samples is illustrated in simulations and an application in Sections 4 and 5.

2.1 Theoretical analysis of IHS

In order to understand the IHS asymptotically, we consider the second-order condition,

$$ \underset{t\rightarrow\infty}{\lim} \frac{\frac{U(tx)}{U(t)}-x^{\gamma}}{A(t)} =x^{\gamma}\frac{x^{\rho}-1}{\rho}, $$
(7)

for x > 0 and with second-order parameter ρ < 0. Here, A(t) denotes a function converging to zero as t goes to infinity and |A| is regularly varying with index ρ. Further, U is defined by \(U(x):=F^{\leftharpoonup }\!\left (1-\frac {1}{x}\right )\), where \(F^{\leftharpoonup }\) denotes the left inverse of the distribution function F. In this setting the following asymptotic normality statements for the Hill estimator \(\hat {\gamma }_{k}\) hold.

Theorem 1 (Theorem 3.2.5 in de Haan and Ferreira, 2006)

Let \(X_{1},\dots ,X_{n}\) be i.i.d. random variables with distribution function F ∈DoA(Gγ) for γ > 0. If Eq. 7 holds and k is an intermediate sequence, i.e. HCode \(k\rightarrow \infty \) and \(k/n\rightarrow 0\) as \(n\rightarrow \infty \), then

$$ \begin{array}{@{}rcl@{}} \sqrt{k}(\hat{\gamma}_{k} - \gamma) \overset{\mathcal{D}}{\longrightarrow}\ \mathcal{N}\left( \frac{\lambda}{(1-\rho)}, \gamma^{2}\right), \end{array} $$

as \(n\rightarrow \infty \) and with \(\lambda := \underset {k\rightarrow \infty }{\lim } \sqrt {k}A(n/k)\).

Theorem 2

Under the conditions of Theorem 1, it holds as \(n\rightarrow \infty \) that

$$ \begin{array}{@{}rcl@{}} \sqrt{k} \left( \frac{1}{\hat{\gamma}_{k}}-\frac{1}{\gamma} \right) \overset{\mathcal{D}}{\longrightarrow}\ \mathcal{N}\left( \frac{-\lambda }{(1-\rho) \gamma^{2}}, \frac{1}{\gamma^{2}} \right). \end{array} $$

Proof

Apply the delta method to Theorem 1. □

Following the reasoning in de Haan and Ferreira (2006), page 78, the minimizing point of the AMSE of the Hill estimator can be found explicitly if considering A(t) = ctρ with c≠ 0. In this special case, the minimizing sample fraction can be expressed as

$$ k_{\text{opt}}= \left( \frac{\gamma^{2} (1-\rho)^{2}}{-2\rho c^{2}} \right)^{1/(1-2\rho)} n^{-2\rho/(1-2\rho)}. $$
(8)

Under the same assumption, we can calculate the minimizing point kIHS of the asymptotic expectations of IHS and IHS. Let \(\mathbb {A}\mathbb {E}\) denote the asymptotic expectation referring to the expectation of the limiting distribution in Theorem 2. Then

$$ \begin{array}{@{}rcl@{}} k_{\text{IHS}} &:= \arg\min_{k}\ \mathbb{A}\mathbb{E}[\text{IHS}]=\arg\min_{k}\left\{ \frac{2}{\gamma k} +\frac{A(n/k)}{2\gamma^{2}(1-\rho) }\cdot \frac{k-4}{k} \right\} \\ &\approx \arg\min_{k}\left\{ \frac{2}{\gamma k} +\frac{A(n/k)}{2\gamma^{2}(1-\rho) } \right\} = \left( \frac{4\gamma(1-\rho)}{-\rho c}\right)^{1/(1-\rho)} n^{-\rho/(1-\rho)} . \end{array} $$

It is easy to check that the same formula holds for IHS if c is replaced by its absolute value. Comparing kopt and kIHS for a fixed \(\rho >-\infty \) we obtain that

$$ \begin{array}{@{}rcl@{}} \frac{k_{\text{IHS}}}{k_{\text{opt}}} \approx \left( \frac{-\rho}{32}\cdot k_{\text{IHS}}\right)^{-1/(1-2\rho)}\approx d\cdot n^{\frac{\rho}{(1-2\rho)(1-\rho)}} \longrightarrow 0, \end{array} $$
(9)

as \(n\rightarrow \infty \) and for a constant d depending on ρ, γ and c. This supports what Eq. 6 already suggested: minimizing IHS gives asymptotically a smaller k than kopt. Thus, kIHS asymptotically performs suboptimally for the Hill estimator but still is an intermediate sequence leading to consistent estimates asymptotically. For finite samples the ratio crucially depends on ρ, and kIHS can also be larger than kopt, as illustrated in Fig. 2. It also holds that \(k_{\text {IHS}}/k_{\text {opt}}\rightarrow 1\), as \(\rho \rightarrow -\infty \), since both sample fractions converge to n in this case.

Fig. 2
figure 2

The approximation of the quotient kIHS/kopt in Eq. 9 is plotted as a function in the second-order parameter ρ with γ = c = 1 for n = 500 (solid), n = 5000 (dashed) and n = 50000 (dotted)

Although kIHS is of smaller order than kopt asymptotically, the simulation study in Section 4 shows that \(\hat {k}_{\text {IHS}}\) works quite well in small samples and in particular when used for quantile estimation of certain processes. We consider the following quantile estimator for the (1 − p)-quantile,

$$ \hat{q}_{k}(p) = X_{(n-k,n)} \left( \frac{k}{np} \right)^{\hat{\gamma}_{k}}, $$
(10)

as defined in Theorem 4.3.8 in de Haan and Ferreira (2006). This theorem shows that the sample fraction kopt also minimizes the asymptotic relative MSE of \(\hat {q}_{k}(p)\). For finite samples, however, the quantile estimator seems to benefit from kIHS, since in this case IHS often works better than procedures estimating kopt, see Section 4. This can have different reasons, two of which are illustrated by Fig. 3. Figure 3 displays the empirical expectation of IHS and of SAMSEE defined in Section 3, the empirical versions of the MSE of \(\hat {\gamma }_{k}\) and the relative MSE of the quantile estimator,

$$ \text{MSEQ}:= \mathbb{E}\left[ \left( \frac{ \hat{q}_{k}(p)}{q(p)}-1 \right)^{2} \right]/ \log\left( \frac{k}{np}\right), $$
(11)

as used in Theorem 4.3.8 in de Haan and Ferreira (2006). The estimator \(\hat {q}_{k}(p)\) is more sensitive to changes in k than \(\hat {\gamma }_{k}\) and thus the empirical version of MSEQ seems to need much more samples than the MSE of the Hill estimator to reach a stable estimate.

Fig. 3
figure 3

Empirical expectations of IHS plus a constant (red), MSE (black), MSEQ (pink) and SAMSEE (blue, see Section 3) are presented. The left plot is based on 10,000 samples from a Fréchet(2) distribution of size 500. The graphic on the right is based on 500 samples of size 5000 from a Loggamma distribution

On the left of Fig. 3, we observe that kIHS is indeed smaller than kopt for a Fréchet distribution, where ρ = − 1, but so is the minimizer of MSEQ. The graphic on the right highlights the similarities between MSE, MSEQ and IHS for the boundary case ρ = 0 of the Loggamma distribution. The method SAMSEE, which is defined in the next section, overestimates kopt on average in the Loggamma case and its mean lies between kIHS and kopt for the Fréchet distribution.

3 Smooth AMSE estimator

In this section, we illustrate a way to estimate kopt via minimizing a smooth estimator for the AMSE of the Hill estimator, called SAMSEE. From Theorem 1 it is easy to see that the AMSE of the Hill estimator is

$$ \mathbb{A}\mathbb{E}[(\hat{\gamma}_{k}-\gamma)^{2}]= \frac{\gamma^{2}}{k} + \frac{A(n/k)^{2}}{(1-\rho)^{2}}. $$
(12)

To estimate the AMSE as a function of k, we employ two estimators, one for γ and one for the bias term as a combination of ρ and A. First we explain how we estimate γ and then we define the bias estimator. For γ we consider the generalized Jackknife estimator \(\hat {\gamma }_{k}^{\text {GJ}}\) introduced by Gomes et al. (2000) as \(\gamma _{n}^{\mathrm {G}_{1}}\). This estimator is defined by

$$ M_{n,k}:= \frac{1}{k}\sum\limits_{i=1}^{k} Y_{(i,k)}^{2},\quad \hat{\gamma}^{\mathrm{V}}_{k}:=\frac{M_{n,k}}{2\hat{\gamma}_{k}} , \text{ and} \quad \hat{\gamma}_{k}^{\text{GJ}}:= 2\hat{\gamma}^{\mathrm{V}}_{k} - \hat{\gamma}_{k}, $$
(13)

where Y(i,k) denotes the log-spacings as in Eq. 3. Note, that \(\hat {\gamma }^{\mathrm {V}}_{k}\) is the de Vries estimator introduced under this name in de Haan and Peng (1998) and \(\hat {\gamma }_{k}\) is the Hill estimator as above. The generalized Jackknife estimator has a reduced bias compared to the Hill estimator and is even asymptotically unbiased if ρ = − 1, see (2.11) in Gomes et al. (2000).

Danielsson et al. (2001) use \((\hat {\gamma }^{\mathrm {V}}_{k}-\hat {\gamma }_{k})\) to access the bias of \(\hat {\gamma }_{k}\) and apply a double bootstrap procedure to stabilize this highly varying estimate. We also use the difference of two estimators for γ, but now consider the following averages of Hill estimators,

$$ \bar{\gamma}_{k,K} := \frac{1}{K-k+1}\sum\limits_{i=k}^{K} \hat{\gamma}_{i} \quad \text{and} \quad \bar{\gamma}_{k} := \bar{\gamma}_{1,k}= \frac{1}{k}\sum\limits_{i=1}^{k} \hat{\gamma}_{i}, $$

where k < K is chosen to obtain a stable bias estimate. The idea to average the Hill estimator in order to reduce variation in the sample path is also studied in Resnick and Stǎricǎ (1997). These averages naturally contain a lot of structural information about the underlying asymptotic bias and inspire the definition

$$ \hat{b}_{k,K}:= \bar{\gamma}_{k,K}-\bar{\gamma}_{K}. $$
(14)

Theorem 3

Under the conditions of Theorem 1 and for \(k/K\rightarrow c\) with 0 < c < 1 as \(n\rightarrow \infty \), it holds for \(\hat {b}_{k,K}\) defined in Eq. 14 that

$$ \begin{array}{@{}rcl@{}} \sqrt{k}\cdot\hat{b}_{k,K} \overset{\mathcal{D}}{\longrightarrow} \mathcal{N}\left( \frac{-\rho\lambda}{(1-\rho)^{2}}\delta_{\rho}(c),\ \gamma^{2} \nu(c) \right), \end{array} $$

where δρ(c) = (cρ − 1)/(−ρ(c− 1 − 1)) and \(\nu (c)=2c^{2}/(1-c)^{2}\cdot (1-c+c\log (c))\) with 0 ≤ ν(c) ≤ 1.

Proof

The proof is given in Appendix A. □

By Theorem 3 the estimator \(\hat {b}_{k,K}\) has the following property if ρ = − 1, because in this case the function δ− 1(c) is equal to 1,

$$ \mathbb{A}\mathbb{E}\left[\hat{b}_{k,K}\right]=\frac{-\rho A(n/k)}{(1-\rho)^{2}} = \frac{1}{2} \frac{A(n/k)}{(1-\rho)} =\frac{1}{2}\mathbb{A}\mathbb{E}\left[\hat{\gamma}_{k}-\gamma\right]. $$
(15)

It remains to choose an appropriate K that is large enough to allow for minimization over all relevant k and small enough to be an intermediate sequence itself, see Theorem 3. To find such a K we use the following relation between the estimators that holds if ρ = − 1,

$$ \begin{array}{@{}rcl@{}} \mathbb{A}\mathbb{E}[\hat{\gamma}_{k}] = \mathbb{A}\mathbb{E}[\hat{\gamma}^{\mathrm{V}}_{k}+ \hat{b}_{k,K}]. \end{array} $$
(16)

This provides us with a relatively stable function in k that has the same asymptotic expectation as the highly non-smooth Hill estimator. We want to find an intermediate sequence K for which Eq. 16 holds and thus define

$$ \text{AD}(K):=\frac{1}{K}\sum\limits_{k=1}^{K} \left( \hat{\gamma}^{\mathrm{V}}_{k}+ \hat{b}_{k,K} - \hat{\gamma}_{k} \right)^{2} $$
(17)

to measure the deviation from approximation (16) uniformly over all kK. We want to find the minimum of AD(K), but not the one that occurs due to high variation for small k. We look for a K such that the stabilized numerical approximation of the derivative of AD is closest to zero,

$$ \begin{array}{@{}rcl@{}} K^{*} :=\underset{K}{\arg\min}\bigg\{ \sum\limits_{i=-2, i\neq 0}^{2} \bigg|\frac{\text{AD}(K)- \text{AD}(K+i)}{i} \bigg| \bigg\}. \end{array} $$
(18)

Now we combine the previously described estimators to approach the AMSE in Eq. 12 under the assumption that ρ = − 1. The idea of misspecifying ρ to simplify estimation – via avoiding the additional uncertainty through estimating ρ or selecting an influential tuning parameter – was already used, for example, by Gomes et al. (2000), Drees and Kaufmann (1998) and Goegebeur et al. (2008). It is also motivated by the simulations in Section 3.1.

Considering K in Eq. 18 and the property of \(\hat {b}_{k,K}\) in Eq. 15, we define estimators for the AMSE of the Hill estimator and for kopt by

$$ \begin{array}{@{}rcl@{}} \text{SAMSEE}(k) &:=& \frac{(\hat{\gamma}_{K^{*}}^{\text{GJ}})^{2}}{k} + 4\hat{b}_{k,K^{*}}^{2}, \\ \hat{k}_{\text{SAMSEE}}&:=& \underset{1<k< K^{*}}{\text{argmin}}\ \text{SAMSEE}(k). \end{array} $$
(19)

The left panel of Fig. 4 displays SAMSEE for a Fréchet sample with parameters γ = 1/2 and ρ = − 1. The Hill plot of the same sample and all kK = 388 is shown in the right panel of Fig. 4.

Fig. 4
figure 4

SAMSEE with K = 388 on the left next to the Hill plot for the same Fréchet(2) random sample of size n = 500 for kK. The red dot indicates the selected sample fraction \(\hat {k}_{\textup {SAMSEE}}\) in the left plot and the adaptive Hill estimate \(\hat {\gamma }_{\hat {k}_{\textup {SAMSEE}}}\) in the right plot

This smooth estimate of the AMSE can be useful beyond the context of threshold selection. SAMSEE evaluates how well each k performs as a threshold in the peaks-over-threshold model, where the lowest values indicate a good fit. Thus, SAMSEE could be used to construct a transition function between bulk and tail distribution in extreme value mixture models or an empirical prior for the threshold in Bayesian threshold selection approaches, see Scarrott and MacDonald (2012) for a review on mixture models.

3.1 SAMSEE if ρ≠ − 1

In the broader context of an unknown second order parameter ρ, we note that the generalized Jackknife estimator is no longer unbiased and the behaviour of our bias estimator changes. From Theorem 3 follows that

$$ \mathbb{A}\mathbb{E}[\hat{b}_{k,K}] = \frac{-\rho A(n/k)}{(1-\rho)^{2}} \cdot \delta_{\rho}(k/K). $$

If ρ≠ − 1, we can observe that the function δρ(k/K) influences the rate at which the bias estimator increases with k. We can still apply SAMSEE in this situation and select K from Eq. 18. However, approximation (16) does not hold anymore and instead it holds that

$$ \begin{array}{@{}rcl@{}} \mathbb{A}\mathbb{E}[\hat{\gamma}_{k} - \hat{\gamma}^{\mathrm{V}}_{k}]- \mathbb{A}\mathbb{E}[\hat{b}_{k,K}] = \frac{-\rho A(n/k)}{(1-\rho)^{2}}\left( 1-\delta_{\rho}(k/K) \right) . \end{array} $$
(20)

The absolute value of the error described by Eq. 20 is high if δρ strongly differs from 1 and the bias term A is large. If ρ≠ − 1, δρ indeed deviates from 1 and we minimize the difference in Eq. 20 by minimizing A. This is why applying (18) in this case leads to a small K. On the other hand, if δρ = 1, the approximation stays valid for an increasing bias and K will typically be larger.

An alternative to fixing ρ = − 1 is to incorporate a consistent estimator \(\hat {\rho }\) of the second order parameter. Estimators for ρ can be found in Caeiro and Gomes (2014), Gomes et al. (2002), and Fraga Alves et al. (2003) or Drees and Kaufmann (1998). Such an estimator can be included into the SAMSEE procedure via

$$ \begin{array}{@{}rcl@{}} &&\qquad K_{\hat{\rho}}^{*} :=\underset{K}{\arg\min}\bigg\{ \sum\limits_{i=-2, i\neq 0}^{2} \bigg|\frac{\text{AD}_{\hat{\rho}}(K)- \text{AD}_{\hat{\rho}}(K+i)}{i} \bigg| \bigg\},\\ &&\text{where }\text{AD}_{\hat{\rho}}(K) :=\frac{1}{K}\sum\limits_{k=1}^{K} \left( \hat{\gamma}^{\mathrm{V}}_{k}+ \hat{b}_{k,K}/\delta_{\hat{\rho}}(k/K) - \hat{\gamma}_{k} \right)^{2}. \end{array} $$

and

$$ \begin{array}{@{}rcl@{}} \text{SAMSEE}_{\hat{\rho}}(k)&:=& \frac{(\hat{\gamma}_{K_{\hat{\rho}}^{*}}^{\text{GJ}})^{2}}{k} + \bigg((1-\hat{\rho}) \frac{ K_{\hat{\rho}}^{*}/k-1}{(k/ K_{\hat{\rho}}^{*})^{\hat{\rho}}-1}\cdot \hat{b}_{k, K_{\hat{\rho}}^{*}}\bigg)^{2}, \end{array} $$
(21)
$$ \begin{array}{@{}rcl@{}} \hat{k}_{\hat{\rho},\text{SAMSEE}}&:=& \underset{1<k< K^{*}}{\text{argmin}}\ \text{SAMSEE}_{\hat{\rho}}(k). \end{array} $$

In Table 1, we present the results of a simulation study indicating for which distributions it is beneficial to use \(\hat {\rho }\) instead of ρ = − 1. We estimate ρ using the estimator \(\hat {\rho }^{(1)}\) suggested in Theorem 1 in Drees and Kaufmann (1998). The results indicate that for these processes it is sensible to fix ρ = − 1 in SAMSEE, since only for the Cauchy distribution using \(\hat {\rho }\) performs slightly better regarding both bias and root mean squared error (RMSE). This confirms the observations already made by others (Gomes et al. 2000; Drees and Kaufmann 1998; Goegebeur et al. 2008), that it is often recommendable to select ρ = − 1 instead of allowing for further variability by including an additional estimator.

Table 1 The averages of adaptive γ estimates and their root mean squared error (in brackets) are presented for thresholds \(\hat {k}\) selected using SAMSEE or \(\textup {SAMSEE}_{\hat {\rho }}\) with the true ρ, ρ = − 1 or \(\hat {\rho }=\hat {\rho }^{(1)}\)

4 Simulation study

In the following, we numerically analyse the performance of ten threshold selection methods on heavy-tailed distributions with very different tail behaviours. The simulation study is based on the following distributions:

  • the Student-t distribution with 6 degrees of freedom, which corresponds to γ = 1/6 and ρ = − 1/3,

  • the Fréchet distribution with parameter α = 2 and distribution function \(F(x)=\exp (-x^{-\alpha })\) for x > 0, which implies γ = 1/2 and ρ = − 1,

  • the standard Cauchy distribution leading to a tail behaviour with γ = 1 and ρ = − 2,

  • the Loggamma distribution with γ = 1 and ρ = 0 and density function

  • the Burr distribution with a parametrisation such that γ = 2, ρ = − 1 and distribution function

    $$ F(x)=1-(1+\sqrt{x})^{-1},\ \text{for } x>0, $$
  • a logarithmically perturbed Pareto distribution of the random variable g(U) with γ = 1 and ρ = − 1, where \(U\sim \text {Unif}(0,1)\) and \(g(x)=x^{-1}/\log (x^{-1})\). This distribution is denoted as negative Bias owing to the negative bias of the Hill estimator when applied to random samples from this distribution (Drees et al. 2000).

On these distributions, we evaluate the methods by their root median square error (RMedSE) when adaptively estimating γ with the Hill estimator relative to the RMedSE obtained using kopt,

$$ \text{EFF}_{\gamma}(\hat{k}):=\sqrt{\frac{\mathbb{M}[(\hat{\gamma}_{\hat{k}}-\gamma)^{2}]}{\mathbb{M}[(\hat{\gamma}_{k_{\text{opt}}}-\gamma)^{2}]}}, $$

where \(\mathbb {M}\) denotes the median. These efficiency quotients are also used with the mean instead of the median by, e.g., Guillou and Hall (2001) and Gomes et al. (2000) and Drees and Kaufmann (1998). We opted in favor of the median to reduce the influence of few extreme outliers. The efficiency values based on the mean are additionally provided in Appendix A.

The smaller the quotient the better the threshold selection procedure performs compared to the optimal sample fraction kopt. Values below 1 indicate that the selection method, that chooses a specific k for each sample, performs better than kopt, which is fixed for all samples of a distribution. Furthermore, we study the efficiency in quantile estimation with the estimator defined in Eq. 10 for p = 0.001,

$$ \text{EFF}_{q}(\hat{k}):=\sqrt{ \frac{\mathbb{M}[ (\hat{q}_{\hat{k}}- q)^{2} ]}{\mathbb{M}[ (\hat{q}_{k_{\text{opt}}}- q)^{2}]}}. $$

Since the true minimizer kopt of the AMSE is not known, we utilize an empirical version suggested by Gomes et al. (2000). Following their approach we approximate kopt by the mean of 20 independent replicates of \(\bar {k}_{\text {opt}}\), which is the minimizer of the empirical MSE based on 1000 samples, i.e. \(\bar {k}_{\text {opt}}= \text {argmin}_{k} \ \mathbb {E}_{n=1000} [ (\hat {\gamma }_{k}-\gamma )^{2} ]. \)

We compare these efficiency values for ten different threshold selection methods. If the choice of tuning parameters is required, we follow the recommendations given by the corresponding authors based on their numerical studies. Note that the choice of these parameters is not data-driven and may not be optimal. Most of the considered approaches are constructed for adaptive estimation of γ applying the Hill estimator. This includes one procedure that looks for a stable region among the Hill estimates, while the others aim to estimate kopt. We also evaluate the performance of the IHS approach discussed in Section 2, which aims to minimize the deviation from the exponential approximation and two methods relying on goodness-of-fit statistics. In total, the following methods are considered:

SAM: :

SAMSEE procedure with ρ = − 1 as defined in Eq. 19 in Section 3,

GH: :

Method by Guillou and Hall (2001) utilizing ccrit = 1.25 and p = 1,

DK: :

Procedure by Drees and Kaufmann (1998) with fixed ρ = − 1,

GO: :

Approach by Goegebeur et al. (2008) defined in their equation (3.3) with fixed ρ = − 1,

DB: :

Double bootstrap approach by Danielsson et al. (2001) with n1 minimizing a criterion suggested in their article, where n1 = n1 − 1/b with b ∈{4,7,12,24,38},

B: :

Method by Beirlant et al. (2002) with ρ = − 1 and taking the median of the estimates for \(k_{0}\in \{3,\dots ,n/2\}\),

RT: :

Method by Reiss and Thomas (2007) with β = 0 as suggested by Neves and FragaAlves (2004),

C: :

Method defined in equation (3.9) of Clauset et al. (2009),

NC: :

Testing approach by Northrop and Coleman (2014) described in Section 4.2 using α = 0.2 and 15 threshold suggestions from the 0.5 to the 0.95 empirical quantile,

sIHS: :

IHS smoothed by using the eBsc package, see Section 2.

Figures 5 and 6 illustrate the efficiency values obtained in our simulation study from random samples of size n = 150, 500 and 5000 drawn from the distributions above. The figures provide a convenient overview on the spread of the efficiency values for each method over the range of considered distributions. All values, including those that are too large for the detailed graphics, can be found in Tables 2 and 3 in Appendix A.

Fig. 5
figure 5

Efficiency values EFFγ based on 2000 samples if n = 150 and n = 500 and on 500 samples if n = 5000

Fig. 6
figure 6

Efficiency values EFFq for p = 0.001 based on 2000 samples if n = 150 and n = 500 and on 500 samples if n = 5000

Table 2 Efficiency values EFFγ based on 2000 samples if n = 500 and on 500 samples if n = 5000
Table 3 Efficiency values EFFq for p = 0.001 based on 2000 samples if n = 500 and on 500 samples if n = 5000

Overall, we see a very diverse picture of methods performing best depending on the distribution and the sample size. The graphs also show that SAMSEE is nicely concentrated close to one for all considered scenarios and especially for quantile estimation. The good performance of SAMSEE stays valid for even more extreme quantiles like p = 0.0001, see Table 4. When looking at adaptively estimating γ in Fig. 5, DK performs comparable well to SAMSEE on small samples, and for larger samples DB and B show efficiency values similarly concentrated as SAMSEE. Considering quantile estimation in Fig. 6 sIHS is an alternative to SAMSEE on small samples with even lower efficiency values for four of the six distributions. The approach B is almost as closely concentrated around 1 as SAMSEE when estimating quantiles from larger samples. The performance of sIHS depends on the underlying distribution and deteriorates with growing sample size. The estimation of γ using sIHS still improves for larger samples but it performs weaker compared to kopt, which confirms the theoretical findings in Section 2.1. However, sIHS performs impressively well, for example, for the loggamma distribution with efficiency values even below 1 in every scenario. In our study, sIHS performs well for smaller samples and for quantile estimation. SAMSEE on the other hand shows impressively stable performance over the whole range of distributions, sample sizes and tasks. This makes SAMSEE a very good choice for threshold selection if no expert knowledge about the underlying distribution or about parameter tuning is available.

Table 4 Efficiency values EFFq for p = 0.0001 based on 2000 samples and n = 500

5 Application to varying extreme value index

In this section, we use the two introduced estimators for the sample fraction in a financial application on operational losses, where we are particularly interested in the distributional properties of very high losses. The observations of interest are operational losses from the Italian bank UniCredit collected from January 2005 to June 2014. The dataset provides scaled losses above 2000 Euro recorded with the corresponding date. The operational losses are grouped by the type of event that caused the specific loss. We consider the event type CPBP. The CPBP losses are caused by clients, products and business practices related to derivatives or other financial instruments. This group provides 16138 observations over the period from 2005 to 2014, which seems sufficient for our local estimation approach. The number of collected losses per quarter is presented on the left of Fig. 7 next to the logarithm of the raw data over time.

Fig. 7
figure 7

The quarterly number of observed CPBP losses is illustrated on the left and on the right the logarithm of the losses is plotted over time with a red line marking the third quartile

It has been discussed before that it is reasonable to assume that the distribution of such extreme losses is heavy-tailed (Chavez-Demoulin et al. 2016; Moscadelli 2004) and that it changes with the financial market over time (Hambuckers et al. 2018; Cope et al. 2012). In Fig. 12 in Appendix A, we provide some graphical justification that this assumption holds locally for the dataset of CPBP losses. In Hambuckers et al. (2018) the data is analysed in a regularized generalized Pareto regression approach including several firm-specific, macroeconomic and financial indicators as covariates. This approach describes the dependence of the GPD parameters on various covariates via parametric functions. We consider the same data, but with a different focus. Our aim is to estimate the time dependent extreme value index γ(t) non-parametrically with a simple ad hoc estimator that extends the estimator from de Haan and Zhou (2020) by utilizing the approaches sIHS and SAMSEE presented in Sections 2 and 3 for the local selection of a threshold. We present the estimator for γ(t) in Section 5.1 and the results we obtain when applying this estimator to the dataset of operational losses are shown in Section 5.2.

5.1 Estimating a varying extreme value index

In de Haan and Zhou (2020), the authors discuss testing for a trend in the extreme value index and estimating it non-parametrically. They consider n independent random variables \(X_{i}\sim F_{i/n}\), where Fs ∈DoA(Gγ(s)) for s ∈ [0,1], and they introduce the following estimator for γ(s), which locally applies the Hill estimator and is based on a global sample fraction k,

$$ \hat{\gamma}_{k}(s):= \frac{1}{2kh} \sum\limits_{i\in I_{n}(s)} \left( \log X_{i} - \log X_{(\lfloor 2nh\rfloor-\lfloor 2kh\rfloor, \lfloor 2nh\rfloor )} \right)^{+}, $$
(22)

where In(s) is the h-neighbourhood of s, i.e. In(s) := {i : |i/ns|≤ h}. This estimator depends on the choice of the bandwidth h and the global sample fraction k, which is then rescaled to 2kh for the individual regions In(s). A small bandwidth h leads to very high variability in \(\hat {\gamma }_{k}(s)\) and a large value of h smooths out all interesting features. Thus, the choice of h should balance these two effects. We suggest a modification of their estimator, where we locally estimate a threshold \(\hat {k}(s)\), i.e.

$$ \hat{\gamma}_{\hat{k}(s)}(s):= \frac{1}{\hat{k}(s)} \sum\limits_{i\in I_{n}(s)} \left( \log X_{i} - \log X_{(\lfloor 2nh\rfloor-\hat{k}(s), \lfloor 2nh\rfloor )} \right)^{+}. $$
(23)

To compare these two approaches, we repeat the simulation presented in de Haan and Zhou (2020), see their Figure 2(i), on samples of size n = 5000 with \(X_{i}\sim \text {Fr\'{e}chet}(1/\gamma (i/n))\), \(\gamma (s)=1+c\cdot \sin \limits (2\pi s)\) and c = 1/4. For both versions of the estimator the bandwidth is fixed to h = 0.025. Figure 8 illustrates the benefits of locally optimizing the threshold via SAMSEE from Section 3, as it strongly tightens the empirical confidence interval around the average, which is obtained from the 2.5% and 97.5% empirical quantiles among 2000 estimates.

Fig. 8
figure 8

The true extreme value index \(\gamma (s)=1+1/4\cdot \sin \limits (2\pi s)\) (red) next to the averaged estimators over 2000 samples. The estimator from de Haan and Zhou (2020) with fixed k = 200 is in black and its empirical 95% confidence interval is dashed. The localised estimator employing SAMSEE is blue with dotted empirical confidence bounds

5.2 Functional extreme value index of operational losses

First, we test if the extreme value index is constant over time. With the test T4 by Einmahl et al. (2016) we can reject the null hypothesis of constant γ for various k in the range of thresholds selected by our procedures. One concern in testing is the possible serial dependence in the financial losses, as discussed in Section 4.2 in de Haan and Zhou (2020), where it is suggested to repeat the test on subsets of the data to reduce the serial dependence. The test T4 still shows significant results if only considering every second or every 5th observation. Thus, we are confident that the extreme value index of the losses is indeed varying over time.

To estimate the time dependent γ(t) we apply the estimator in Eq. 23. We select \(\hat {k}\) by using sIHS from Section 2 and the SAMSEE method from Section 3. It remains to choose the bandwidth h that defines the neighborhood In(s). To our knowledge, no data-driven automated method for choosing an optimal bandwidth in the context of the peaks-over-threshold approach is available, and its investigation lies outside the scope of this article. Instead, we use a bandwidth of 170 days and thereby focusing on the long term trend of γ, since this bandwidth leads to a less varying extreme value index by smoothing over seasonal variations. Figure 9 shows the estimates of γ(t) that we obtain for the event type CPBP. It is clearly visible that both procedures yield similar estimates for most time points and that the simple ad hoc estimators recover an increase of the severity of high losses during the financial and Euro crisis from 2008 to 2011. A similar overall trend in the extreme value index can also be identified in the estimates of Hambuckers et al. (2018) for CPBP.

Fig. 9
figure 9

The non-parametric estimate of the extreme value index of the operational losses of type CPBP is presented using \(\hat {k}\) from SAMSEE (solid) and sIHS (dashed) and bandwidth h = 170 days. The red area indicates the time of the financial and Euro crisis

For a more extensive discussion of the data and results of the more complex model including further covariates we refer to Hambuckers et al. (2018).