Outlier detection based on extreme value theory and applications

Whether an extreme observation is an outlier or not depends strongly on the corresponding tail behavior of the underlying distribution. We develop an automatic, data‐driven method rooted in the mathematical theory of extremes to identify observations that deviate from the intermediate and central characteristics. The proposed algorithm is an extension of a method previously proposed in the literature for the specific case of heavy tailed Pareto‐type distributions to all max‐domains of attraction. We propose some applications such as a tail‐adjusted boxplot which yields a more accurate representation of possible outliers, and the identification of outliers in a multivariate context through an analysis of associated random variables such as local outlier factors. Several examples and simulation results illustrate the finite sample behavior of the algorithm and its applications.


INTRODUCTION
and cyber crime. It then is important to judge if an observation or a set of observations can be explained from the model fitted to the bulk of the data or from a deviating model. A classical and popular tool for flagging outliers is given by the boxplot introduced by Tukey (1977) marking an individual data point as a possible outlier when it has a distance larger than 1.5 interquartile range (IQR) from the corresponding quartile Q 1 or Q 3 . In fact, when the underlying distribution of the data is normal, X ∼ N( , 2 ), the probability for an extreme normal observation to be wrongly indicated as an outlier in a classical boxplot (CL) is very small: at the right-hand side tail of the distribution it is approximately given by P(X > Q 3 + 1.5IQR) = P(X > [ + 0.67 ] + 1.5[1.34 ]) = P(Z > 2.68) = 0.0037, where Z ∼ N(0, 1). However, for distributions with a tail heavier than the normal model, the probability for a data point to be flagged increases with the heaviness of the tail distribution. The probability P(X > Q 3 + 1.5IQR) equals 0.05 for an exponential distribution, while for a lognormal distribution, that is, log X ∼  ( , 2 ), this probability raises up to 0.08 when = 1 and 0.16 when = 2. For the Pareto distribution with distribution function F(x) = 1 − x − (x > 1, > 0), this probability mounts to 0.073 with = 4 and 0.16 when = 1. In such cases the boxplot grossly overestimates the number of real outliers. At the other side of the spectrum, considering tails with a finite right endpoint, the probability of indicating an observation as a potential outlier is typically lower, for instance 0 for the uniform distribution on (0,1), since then real outliers can well be situated below Q 3 + 1.5IQR.
In Hubert and Vandervieren (2008) the CL was adjusted for skewness using some robust skewness estimators. However, in general, tail heaviness cannot be determined on the basis of a skewness measure. Here the extreme value index (EVI) is used as an indicator of tail heaviness to detect outliers.
Extreme value theory (EVT) is a natural methodology to describe and estimate tail heaviness. EVT starts from considering the limit distribution of the linearly normalized maximum of sequences of independent and identically observations X 1 , X 2 ,…, X n : for some sequences (a n > 0) and (b n ), where the generalized extreme value distributions (GEV) with distribution function (df) G are the only possible limit distributions. The GEV family is parameterized by the EVI ∈ R. In case = 0 the limit GEV is given by the Gumbel distribution with G 0 (y) = exp(−exp(−y)), y ∈ R. The set of distributions F for which (1) holds with = 0 mainly contains exponentially decreasing tails such as normal, gamma, Weibull, and lognormal models. The Weibull max-domain corresponding to < 0 consists of distributions with a finite endpoint. When > 0 the distributions F satisfying the limit result (1), composing the Fréchet domain of attraction, are given by the set of Pareto-type distributions with right tail function (RTF) given by where is a slowly varying function at infinity, that is, (ty) (t) → 1 as t → ∞, for every y > 1.
Hence the EVI is a measure of the tail-heaviness of the distribution of X with a larger value of implying a heavier RTF F(x) = 1 − F(x) = P(X > x). While the EVT methods as described above concentrate on the right-hand side tail, the method can be extended to the left-hand side tail by applying it for instance to the transformed variables such as 1∕X in case of positive data, or to −X in case of negative values to the left of the median. Naturally, detection of possible outliers based on tail information should be based on a robust estimator of the EVI . To this end let 0 ≤ X 1,n ≤ X 2,n ≤ · · · ≤ X n,n denote the ordered observations. In the specific case of Pareto-type observations the detection of outliers based on a preliminary robust estimate of > 0 was proposed in Bhattacharya et al. (2019) usinĝ with a trimmed version of the Hill estimator in Hill (1975) For Equations (3), (4), and (5) to hold, we must consider k large enough such that X (n−k,n) ≥ 0. If such a k does not exist, we work with negative of the observations to the left of the median as explained above.
In this paper we adapt the approach from Bhattacharya et al. (2019) to work for a real-valued EVI . In Bhattacharya et al. (2019) the number of outliers is estimated using a sequential method identifying the largest value of k 0 for which yields a significant value. We construct a new algorithm starting from the statistic with k 0 = 0, 1,…, k − 2, which is shown to be asymptotic standard exponentially distributed in Theorem 1 under some regularity conditions. Here, on the right-hand side of Equation (7) denotes the true value of the tail index.
As the distribution of E k 0 ,k depends on , a starting initial value for is needed. Robust estimators for a real-valued EVI were already proposed in Dupuis and Field (1998), Peng and Welsh (2001), and Juarez and Schucany (2004). For our outlier detection procedure, we here propose to use log(X n−j+1,n H m,j ) − log(X n−k+1,n H m,k ), for 0 ≤ m < k ≤ n where GH m,k is a trimmed version of the generalized Hill estimator GH k of ∈ R given by which was introduced in Beirlant et al. (2005). Imputing the robust estimator̂∶= GH m,k of for a given choice of m and k, leads to a pivot which is approximately uniform (0,1) distributed under the assumption of absence of outliers whatever the value of . Note the continuity in of E k 0 ,k at 0. We then flag k 0 observations as outliers when observing significant large values of Uk 0 ,k . In order to illustrate the determination of outliers based on EVT methods we consider precipitation data from France as considered in Bernard et al. (2013). Weekly maxima of hourly precipitation at 92 French stations during the fall season (September-November) from 1993 to 2011 are considered, as provided by the French meteorological service Météo-France.
Here we consider the data from Chamonix and Uzein, respectively, situated in that South-East and South-West of France. Bernard et al. (2013) situated the Chamonix case in the Gumbel domain while the Uzein data appeared to be of Pareto-type.
Concerning the Chamonix precipitation data (N 45.93 • , E 6.88 • ) Bernard et al. (2013) proposed the estimate 0.01 for , so that this right-hand tail is situated near the Gumbel domain. The linear fit on the exponential QQ-plot of the Chamonix data given in Figure 1 indeed suggests that the tail behavior is close to exponential. The CL indicates six potential outliers.
In contrast to the Chamonix data, the precipitation at Uzein (N 43.38 • E −0.42 • ) appears to have a Pareto-type behavior as concluded already in Bernard et al. (2013) with an EVI estimate of 0.35. This is also confirmed by a linear regression fit on the top 64 points of the Pareto QQ-plot in Figure 2. Here the CL indicates eight possible outliers. These examples will be taken up again in Section 4 below where we propose a tail-adjusted boxplot where the whiskers extend from the upper quartile Q 3 up to the largest value among all nonoutlying observations, and similarly for the left whisker below Q 1 . In both cases the proposed algorithm will return no outliers.
More generally, the identification of outliers in a multivariate context often relies on the detection of extreme observations of an associated random variable. Below in Section 5 we will discuss application of the proposed outlier detection algorithm when using local outlier factors (LOFs), a classical method to rank multivariate observations in order to detect multivariate outliers.
The rest of the paper is organized as follows. In Section 2 we develop the algorithm for outlier identification based on extreme value methodology for real-valued EVI . We provide mathematical justification for the proposed procedure under some assumptions on the quantile function of the underlying distributions, which are rather classical in EVT. Proofs are deferred to the Appendix. In Section 3 we report on a simulation study of the univariate detection algorithm using distributions from all max-domains. In Section 4 we discuss univariate case studies, while in Section 5 the application to multivariate outlier detection using LOFs will be discussed including simulations and case studies.

TESTING FOR OUTLIERS UNDER EXTREME VALUE CONDITIONS
As discussed above, the asymptotic standard exponential distribution of E k 0 ,k allows to construct the outlier detection algorithm. In the next subsection we specify this result, after which we can detail the algorithm for automated selection of the number of outliers k 0 .

2.1
On the asymptotic distribution of E k 0 ,k We here consider asymptotic properties as k, n → ∞ with k∕n → 0 and k 0 ∕k → 0. Let U be the tail quantile function defined as U(x) = F ← (1 − x −1 ) with the quantile function F ← defined as the left continuous inverse of the cumulative distribution function F. The classical second order regular variation condition on U is given by with a(t) and A(t) defined as in theorem 2.3.3 of de Haan and Ferreira (2006). Given that the Hill statistic is defined in terms of the logarithm of the empirical quantiles, we need the classical regular variation conditions on log U instead. Let ≠ , < 0, ≠ 0. Then, the second order condition for log U is given by where − = min( , 0) and ′ as defined in lemma B.3.16 of de Haan and Ferreira (2006) satisfies ′ < 0. The functions q(t) and Q(t) are same as defined in section 3.5 of de Haan and Ferreira (2006).
Remark 1. The cases of < 0, = 0 and = 0, ≠ 0 can be mathematically handled using (12). However, since they rarely occur in practice, we do not explicitly deal with their proof. The case of = cannot be handled with the second-order condition on log U as stated above.
We can now state the asymptotic distribution of E k 0 ,k .
We then have that E k 0 ,k as in (7) converges in distribution to the standard exponential distribution.
The proof of this theorem is provided under Appendix C.
Remark 2. Using the delta method one can show that for any consistent estimator̂, Ek 0 ,k also converges to the standard exponential distribution. For explicit results on the rate of convergence to the standard exponential distribution as in Theorem 1, one needs third order regular variation assumptions in addition to (11) or (12).
In the next section, we shall use the trimmed version of the generalized trimmed estimator GH m,k as in (8) for a possible choice of the tail index estimator̂. Using the methods from Beirlant et al. (2005), GH m,k can be shown to be consistent for as k, n → ∞ and m∕k → 0. This, however, is not the focus of our current work and can be explored in the future.

Algorithm for automated selection
We now specify the method for estimating the number of outliers.
(1) Robust estimation of . For a given choice of m and k, one can substitute ∈ R in (7) by the generalized trimmed Hill estimator GH m,k from (8), which we denote hereafter bŷ. Other robust estimators of ∈ R, such as the estimator proposed in Juarez and Schucany (2004), can also be used here. However, in extensive simulation experiments, the use of GH m,k gave the best outlier detection results.
(2) Based on Theorem 1, the statistic Uk 0 ,k from (10) approximately follows the U(0, 1) distribution. With a given value of k, we tend to flag k 0 observations as outliers when observing significant large values of Uk 0 ,k . (3) Estimation of k 0 . For a given value of (m, k) and some significance level j , we obtain an estimate of k 0 ask The preceding steps require a choice for m and k. We index these parameters as for some maximum values * and * , and we evaluate the estimatesk 0 =k 0 ( , ) as a function of and . We finally decide on a final number of outliers k 0 through a majority vote principle. Let {f 1 , f 2 ,…} be the enumeration of the corresponding proportions of occurrence. The final estimate of k 0 is then defined byk * A plot ofk 0 ( , ) as a function of ∈ [0, * ] and ∈ [0, * ] shall be referred to as the majority vote plot: Extensive sensitivity analysis showed that the choice ( * , * ) = (0.6, 0.1) works well. This choice is used throughout below. The final estimatek * 0 is the one which occurs most often in this plot. Figure 3 gives the majority vote plot for varying simulation models. For the definition of these distributions we refer to Section 3. The solutionsk j 0 (j = 1, 2, 3) corresponding to the three highest proportions f j are also listed in decreasing order of f j . Of course a special notification should be given in the unlikely event that the top two (or more) proportions f j turn out to be equal, in which case we propose to choose the solution with the smallest numberk j 0 of outliers.
Choice of j . As in Bhattacharya et al. (2019), the levels j in the above algorithm are chosen with a > 1 and c = 1∕ ∑ k−2 j=0 a k−j−1 . This choice of j guarantees ∏ k−2 j=0 (1 − j ) = 1 − q which ensures that a test based on k independent U(0, 1) random variables leads to a type 1 error of q (see relation (2.9) in Bhattacharya et al., 2019). In view of Theorem 1, this implies approximately, so that the algorithm is well calibrated. This choice puts less weight on large values of j which implies that large values of j are less likely to be chosen over smaller ones. This guards against encountering spurious values ofk 0 close to k. Extensive sensitivity analysis showed that the choice of levels as in (14) with a = 1.2 works well in practice, also for lighter tails than the Pareto-type models considered in Bhattacharya et al. (2019).
This then leads to the outlier detection algorithm 1. The algorithm presented in Bhattacharya et al. (2019) can also be adapted using a majority vote decision principle. This algorithm is provided in Appendix A.
In all simulations and case studies presented in Sections 3 and 4 the parameter q was set to 0.05.

SIMULATIONS
In this section, we study the accuracy of the DAST algorithm as given in Algorithm 1 above as an estimator of the true number of outliers k 0 . The performance ofk * 0 is evaluated in terms of its expected value E(k * 0 ), and SD √ Var(k * 0 ). These results are based on 2500 independent Monte Carlo simulations. We generated n = 1000 i.i.d. observations from the following distributions with > 0, = 0 and > 0, respectively.
3. Case < 0: Outlier scenarios: In Sections 3.1 ( > 0), 3.2 ( = 0) and 3.3 ( < 0), outliers are introduced in the extreme observations perturbing the top k * 0 order statistics as follows: Scaled outliers: Note that the correct number of outliers is denoted with k * 0 when L ≠ 1 and C ≠ 1. The constants L and C control the intensity of the injected outliers. Whereas L < 1 and C < 1 shrinks the top k * 0 order statistics, scenarios L > 1 and C > 1 inflate the top k * 0 values. The case of L = 1 and C = 1 correspond to the regime of no outliers. All results are plotted on the same vertical scale to allow comparisons between the different cases. The mean squared error (MSE) results are also summarized in Tables B1 and B2 of Appendix B.

3.1
Case > 0 Here we generate data from the Burr(1, 0.5, 1∕0.5) distribution with tail index = 0.5. For comparison, we also provide the results from the majority vote modification of the WST (Weighted Sequential Testing) algorithm in Bhattacharya et al. (2019) (see Algorithm 2 in Appendix A).
In Figure 4 the performance of the DAST (denoted as EV) and modified WST (denoted as EV+) algorithms is given for varying intensities of the exponentiated and scaled outliers according to (15) and (16), respectively. The true number of outliers (denoted as T) are indicated with the horizontal line. The results of the detection rule used in the CL are not taken up in the figures as the results are out of the present bounds.
The performance of DAST and the modified WST algorithms are comparable in spite of the fact that DAST is constructed to handle all domains of attraction in contrast to the WST algorithm.
As L and C deviate from 1, the performance of the algorithm improves and the best results are obtained at the lowest and largest values of L and C for all values of k * 0 , with improved performance in the region L, C > 1. The results deteriorate with increasing k * 0 . The performance is superior with scaled outliers since the chosen C values produce more pronounced outliers.

Case = 0
Here data are generated from the |N(0, 1)|, Lognormal(0, 1) and Weibull(1, 0.5) distributions and the results are given in Figures 5, 6, and 7, respectively. We compare the results of the DAST algorithm (EV) with the detection rule used in the CL, and a modification (E0) of the DAST algorithm in which is replaced by the true value. For all the distributions the CL rule has a poor performance and overestimates the number of outliers at small k * 0 values. In the Lognormal and Weibull case, the overestimation is to an extent that thek * 0 values fall outside the range of the plot. Indeed, for the Lognormal and Weibull distribution, the DAST outperforms the CL for all cases. The performance of DAST and modified DAST are more or less similar.
In the |N(0, 1)| case at L = 10, C = 10, DAST does a better job both in terms of bias and SD over the modified DAST which uses the correct value of . On the other hand, at L = 3, DAST does better than its modified version in terms of bias but performs worse in terms of SD. Overall, the DAST outperforms the modified DAST in the regime L, C > 1, while the opposite holds in the regime L, C < 1.

Case < 0
Here, we use the Beta(1, −1∕0.5) and Reverse Burr(1, 0.5, −1∕0) distributions, both with EVI = −0.5. Similar to Section 3.2, in Figures 8 and 9 we compare the results of DAST (EV) with the CL and the known modification (E0) of the DAST algorithm. Note that for both distributions the CL fails to capture downscaled outliers (i.e., when L, C < 1). In case of the Beta distribution the Top: Exponentiated outliers. Bottom: Scaled outliers. Left: CL outperforms the DAST at k * 0 = 20 and L = 3, 10 and C = 10. For the same set, the performance of the modified DAST slightly outperforms the DAST. For L = 3, DAST does a better job both in terms of bias and SD over the modified DAST which assumes the true to be known. At L = 10, DAST outperforms in terms of bias. However, in terms of SD, the modified DAST outperforms DAST at k * 0 = 10 while the opposite happens at k * 0 = 20. In the Reverse Burr(1,0.5,−0.5) case, DAST outperforms the CL in terms of bias but has a larger SD, and the performance of DAST and modified DAST are more or less similar. Further, the DAST results in the Reverse Burr case in general compare well with those of the corresponding Burr setting.
Overall we can conclude that while the DAST algorithm is designed to accommodate distributions in all domains of attraction, it compares favorably with the competing WST algorithm which was designed for > 0 only. Also the algorithm in general does not suffer from the need to use a robust estimator of when comparing with the same approach where the exact value of is imputed.

Outliers in mixture distribution
Here, we generate n = 1000 observations from |N(0, 1)| and replace p% observations by In Figure 10, we compare the results of DAST (EV) with the CL and the known modification (E0) of the DAST algorithm for varying values of and c. With increase in , the tail heaviness of the outliers compared to |N(0, 1)| increases. With increase in c, the strength of the outliers injected further increases. The notationp is used to indicate the estimated proportion of outliers.
Var ( At 2% injected outliers, the DAST outperforms the CL at all values of . At 5% injected outliers, the performance of DAST is poor at = 0.5. However, at large c = 200, the DAST improves and nearly recovers the true proportion. Similarly, when the tail index increases, the performance of DAST improves and almost recovers the true proportion at = 10. The same phenomenon is also observed at 10% of injected outliers. These observations can be explained by the fact that at 2% of mixing, the injected observations are indeed outliers whereas at 10% mixing with small c and small , we start dealing with mixture of two distributions and not necessarily outliers. DAST which was developed to identify outliers based on tail difference is not well equipped to handle generic mixture distributions. The MSE results are also summarized in Table B3 of Appendix B.

TAIL-ADJUSTED BOXPLOT AND CASE STUDIES
In this section, we apply the DAST algorithm for detection of outliers for some cases that exhibit some deviating data at one or both tails, and that appear to belong to different max-domains. We present the results using a tail-adjusted boxplot: apply the DAST algorithm to obtain the right tail outliers, while the bottom outliers are determined by applying the DAST algorithm to the right tail of the transformed random variable 1∕X in case of positive data, and −X in case of negative left tail data. The whiskers of the tail-adjusted boxplot extends from the upper, respectively lower, quartile to the extreme observations just below, respectively above, the set of identified outliers. In order to break ties in X for applying DAST, each point X i is rescaled to 10X i ∕⌊log 10 X n,n ⌋ and then dithered by adding a small noise from U(−0.01, 0.01).

French precipitation dataset
Continuing the analysis of the Precipitation datasets from Chamonix and Uzein started in Section 1, we here apply the DAST algorithm to identify top outliers. For both Chamonix and Uzein, the majority vote plots (13) convincingly indicate 0 outliers (see Figure 11), in contrast to the six and eight outliers, respectively, indicated in the CL (see Figures 1 and 2). In case of the Uzein data, note that the second largest proportion ofk 0 values corresponds to 13 outliers. These 13 points indeed exhibit some deviating behaviour in the Pareto QQ-plot in Figure 2. While this is interesting to note from a data analytic point of view, this choice is only supported in second order by the DAST algorithm.

Auto dataset
The Auto dataset, available at https://archive.ics.uci.edu/ml/datasets/auto+mpg, lists the weight of car types. This variable was one of the car characteristics used for prediction of the fuel consumption in miles per gallon (mpg) in Quinlan (1993). The top outliers and bottom outliers are identified by applying the DAST to the data and the inverse of the data respectively. The CL identifies no outliers whereas the tail-adjusted boxplot (see Figure 12) identifies six top and two bottom outliers based on the majority vote plots for both tails in Figure 13. Both the top and bottom tail of this example lie in the Weibull domain ( < 0), which is confirmed by the generalized QQ-plots in Figure 13 which exhibit a decreasing pattern leading to negative GH k values as defined in (9).

Condroz dataset
The Condroz dataset with calcium content measurements together with the pH level of soil samples in the Condroz region of Belgium was discussed in detail in Goegebeur et al. (2005), and has been analyzed for outliers in Vandewalle et al. (2007), Hubert and Vandervieren (2008), and Bhattacharya et al. (2019) ( Figure 14). As in these references, we consider the conditional distribution of the calcium content for pH levels lying between 7 and 7.5 leading to n = 420 data values. Both Applying the DAST algorithm to the data and the inverse of the data respectively, we identify six top and three bottom outliers (see Figure 15), which are indeed separated from the main bulk of the data in the time plot and the Pareto QQ-plot.

MULTIVARIATE OUTLIER DETECTION
In this section we consider application of DAST to the problem of detecting outliers in multivariate data. This application hinges on the fact that most outlier detection methods assign scores to observations, which is then used to provide a ranking of the observations in decreasing order of likelihood of an observation being an outlier. We summarize some of these methods here. Density-based outlier detection methods assign scores to observations using local density estimates. Examples include LOF (Breunig et al., 2000), connectivity-based outlier factors (Tang et al., 2002), local correlation integrals (Kriegel et al., 2009), influenced outlierness (Jin et al., 2006), local outlier probabilites (Kriegel et al., 2009), and more can be found in Madsen (2018). There are several clustering-based methods that assign outlier scores to observations. For example, Ramaswamy et al. (2000) proposed applying the kNN algorithm to a dataset and use the distance from an observation to its closest centroid as its outlier score. DBSCAN (Ester et al., 1996) is a popular density-based clustering method that can identify observations that do not belong to any particular cluster when constructing its clustering solution. As noted by Campello et al. (2015), there are criticism against labeling these "noisy" observations outliers. Instead, the global-local outlier score from hierarchies (GLOSH) was developed from an hierarchical implementation of an augmented version of DBSCAN. The isolation forest (Liu et al., 2008) can also be viewed as clustering based, since it performs multiple segmentations of the data in the form of binary trees. The binary trees are fitted to subsamples of the available data in an unsupervised manner, where random splits are performed on the available features. The algorithm is based on the idea that outliers are isolated into singleton nodes at low depths of these trees. All the binary trees are combined to produce a single, more robust, outlier score.
After a ranking has been obtained, one could reject the top 5% (say) of observations as outliers, or resort to some heuristic cut-off value for the method under consideration. Instead, one could apply DAST to these scores in order to automatically estimate the number of outliers present in the data. As proof of concept we consider applying DAST to the LOF method since it is (a) a classical and well-known outlier detection method and (b) there is a rule of thumb cut-off value we can use for comparison. We also consider an approach based on the DBSCAN algorithm for comparison.

Local outlier factors
We provide an informal discussion of the LOF method and refer to Breunig et al. (2000) for a more detailed description. Consider a set of data  = {x i ∶ i = 1, 2,…, n} and let ≥ 1 be an integer. For the purpose of this paper, we measure the distance between vectors by the Euclidean distance, which we denote by d (., .).
Here card(.) denotes the cardinality of a set. Essentially, the -distance is the distance from an observation x to its th nearest neighbor, and we define the -neighborhood of x by: The -distance is used to define the reachability distance of an observation x ∈  with respect to another observation The local reachability density of x is then defined as and can be interpreted as the reciprocal of how reachable x is from its neighbors on average. A high value of  (x) indicates that x is very reachable from its neighbors. Finally, the LOF of an observation x is given by ) . TA B L E 1 Scenarios for the simulation studies of Section 5.2. Notes: Average number of observations labeled as outliers, using (17), across the simulations are indicated in the right side of the table. In the simulations = 0.02 was used.
Intuitively, a high LOF value means that the neighbors of an observation tend to be more reachable than the observation itself. This suggests that the local density of the observation is low and that the observation is potentially an outlier. In Breunig et al. (2000) it is argued that if an observation lies deep inside a cluster (or close to a mode of the underlying distribution) then the LOF score is likely to be close to 1, which motivates the rule of thumb frequently encountered in the literature of checking observations with an LOF score exceeding 1 as potential outliers. In our experiments this rule of thumb was used as a baseline. In all the experiments, LOF scores were computed using the implementation in Hahsler et al. (2019). Table 1 contains three different scenarios for simulating reference and outlying observations. We use the notation f ∼ N( , ) to indicate that f is the density function of a multivariate normal distribution with mean vector and covariance matrix . Similarly, f ∼ t v ( , ) indicates that f is the density function of a multivariate t distribution with degrees of freedom v, mean vector and scale matrix . For every scenario, observations are generated according to two specified density functions f 0 and f 1 , with denoting the sampling fraction from f 1 . In all cases, f 0 and f 1 have the same location parameter, but may differ in terms of the scale matrix and/or tail-heaviness. Differences in the scale matrices are regulated by a scaling parameter c which is taken from {3, 5, 7, 9}, while differences in tail-heaviness is achieved by varying the degrees of freedom of the multivariate t distribution. An observation x is labeled an outlier if

Simulations
and an inlier otherwise. The reason for labeling observations this way is to obtain a more concise definition of an outlier, since some observations generated according to f 1 can appear to come from f 0 (or vice versa) due to random chance. To illustrate this labeling approach, a sample of size 100 of dimension 2 is generated according to Scenario 1 with c = 3 and = 0.2. One would expect observations closer to the origin to have been generated from f 0 rather than f 1 . The left part of Figure 16 displays this sample, with the black circles and red crosses indicating observations that are from f 0 and f 1 , respectively. There are cases where observations generated from f 1 appear to have been generated from f 0 , and it would not make sense to label these as outliers. Conversely, some observations generated according to f 1 , appear to have been generated according to f 0 , and these should not be labeled as inliers. The middle part of Figure 16 displays the same sample, F I G U R E 16 Left and middle: Illustration of the effect of using equation (17) to label outliers. Left: Before application of (17). Middle: After application of (17). Right: Example of simulated dataset where domain adapted sequential testing local outlier factor detected no outliers for all . Red crosses indicate observations labeled as outliers.
but the black circles and red pluses now indicate observations labeled as inliers and outliers, respectively, by using (17). For every scenario, 100 contaminated datasets are generated for each of c ∈ {3, 5, 7, 9}, where each dataset consists of 980 observations according to f 0 and 20 according to f 1 , that is, = 0.02. Observations are labeled as inliers/outliers using (17) and, since f 0 and f 1 depends on the chosen scenario and value of c, there is a natural variation in the number of labeled outliers contained in a contaminated dataset. Table 1 contains information about this variation, by reporting the average number of observations labeled as outliers as a function of the scenario and value of c.
For every contaminated dataset, the LOF scores corresponding to the neighborhoods of sizes where ∈ {100, 110, 120,…, 890, 900} are determined. DBSCAN requires the specification of two parameters, namely the minimum neighborhood size and neighborhood radius. The former has a similar interpretation to in LOF and so we use {100, 110, 120,…, 890, 900} for this parameter as well. In Hahsler et al. (2019) it is recommended to set the neighborhood radius where the knee occurs in a plot of the sorted nearest neighbor distances corresponding to − 1 neighbors. Since we are performing multiple simulations across a variety of settings, we need an automated method of selecting the neighborhood radius from the nearest neighbor distance plot. For this purpose we use the kneedle algorithm Satopaa et al. (2011) as implemented in Tam (2022. To measure the accuracy of a proposed classification of inliers/outliers we use the F1 score. Let tp and fp be the number of observations correctly and incorrectly classified as outliers respectively, and tn and fn the same but for inliers. The precision and recall are defined as tp∕(tp + fp) and tp∕(tp + fn), respectively, and the F1 score as For each ∈ {100, 110, 120,…, 890, 900}, the following four methods are used to obtain a proposed inlier/outlier classification of the observations: 4. DBSCAN. The noise observations identified by DBSCAN for a given minimum neighborhood size, and neighborhood radius selected by the kneedle algorithm, are labeled as outliers.
The results of the simulation study are shown in Figure 17. The black solid curve shows the average of the F1 scores obtained over the 100 datasets, using the best F1 method, as a function of the LOF neighborhood size. This serves as an upper bound to the performance of any cut-off selection method applied to LOF, since it is determined by the actual labels which is not available in practice. The red dashed, blue dotted and green double-dashed lines show the same statistics, but for the baseline, DAST LOF and DBSCAN methods, respectively.
The baseline LOF method performs poorly across all the simulations, and this is caused by the fact that it rejects too many observations as outliers, that is, the precision values are close to zero. DAST LOF consistently outperforms the baseline method, and its performance improves as c increases. It achieves near optimal performance in Scenario 2, where the difference in the tail indices of f 0 and f 1 is maximum among all scenarios considered. In contrast to baseline LOF, DAST LOF achieves comparable performance to DBSCAN in scenarios 1 and 3, outperforming the latter in scenario 2.
In some of the scenarios (Scenario 1 with c = 3 for example) DAST LOF frequently detects no outliers in the simulated datasets. One such example is given in the right side of Figure 16. There is one outlier here according to (17) indicated by the red cross, and DAST LOF found no outliers for all values of considered. In such situations we allocate a F1 score of zero, a heavy penalization since the red cross is close to the border of the other observations. The reason why DAST LOF does not find this observation to be an outlier is because its LOF score does not manifest as being of different tail-index than those of the inliers. One would expect DAST LOF to do well when TA B L E 2 Summary statistics of the iris, breast cancer, and ionosphere datasets. observations labeled as outliers have LOF scores with a different tail-index than the LOF scores of observations labeled as inliers. If the LOF scores of outlying observations do not manifest in this way, is is not reasonable to expect DAST LOF to detect them, and another method, such as DBSCAN, may be better suited for this task. It is often worthwhile to combine the results of outlier detection methods to obtain improved results. We experiment with a simple method that uses the DAST LOF inlier/outlier classification when it is able to find at least one outlier, and the DBSCAN classification otherwise. We did this because we found DAST LOF to perform well when it is able to detect at least one outlier. This method is labeled DAST & DBSCAN, the average F1 scores of which is traced by the purple dashed-dotted curve in Figure 17. The DAST and DBSCAN method compares favorably to the DAST LOF and DBSCAN methods in Scenario 1. DAST and DBSCAN is identical to DAST LOF in Scenario 2, because DAST LOF rarely detected no outliers in these simulations. In Scenario 3, DAST and DBSCAN outperforms DAST LOF, and compares well with DBSCAN over a wide range of values. In Scenario 3 it seems that the LOF scores of observations labeled as outliers sometimes have a tail-index different from those of the inliers and sometimes not.

Case studies
Outlier detection methods are frequently evaluated by classification datasets, where the idea is to select a subset of the classes to represent outliers. We follow this convention by considering the iris, ionosphere, and Wisconsin breast cancer diagnostic classification datasets. All these data sets are available in the UCI machine learning repository (Dua & Graff, 2019). Some summary statistics of the datasets can be found in Table 2. For the breast cancer dataset, the identification variable is excluded from the analysis and observations with missing values are removed. Two predictor variables of the ionosphere dataset are excluded, one binary and one which is constant across all observations. Using classification datasets for benchmarking outlier detection methods comes with their own issues and drawbacks. We consider these briefly here, and refer to Steinbuss and Böhm (2021) for further details. The first issue is the selection of the classes to represent outliers. The pca plot of the iris data in Figure 18 reveals that the setosa iris species differs significantly from the versicolor and virginica species, while there is some degree of overlap between the latter two species. Hence we use the setosa species to represent outliers, and the versicolor and virginica species to represent inliers. The ionosphere and breast cancer data sets only have two classes, and for each we took the under-represented class to represent outliers, that is, malignant masses for breast cancer and bad radar returns for ionosphere. Note that the classes chosen to represent outliers may represent a significant mode in the data. For instance, in the case of the iris data, the setosa species comprises 33% of the data, and we would not generally expect such a large proportion of outliers in a data set. We can regulate the number of observations labeled as outliers by taking subsamples from F I G U R E 18 Plots of the first two principal components for the datasets under consideration. Left: Black circles, red pluses and green crosses represent the setosa, versicolor, and virginica species from the iris data. Middle: Black circles and red pluses represent the benign and malignant masses from the breast cancer data. Right: Black circles and red pluses represent the good and bad radar returns from the ionosphere data. the outlying class. In this way we can also investigate the sensitivity of methods to the number of outliers present in the data.
One of the major drawbacks of using classification data sets when benchmarking outlier detection methods is that there could be significant overlap between the classes chosen to represent inliers and outliers (see the ionosphere data in Figure 18 for example). In such cases, a subsample taken from the outlying class can very well be indistinguishable from the inlier class. If we construct a data set by adding this subsample to the inlier class, with the former labeled outliers, it would lead to a distorted benchmark. In addition, we do not have access to the data generating mechanism in order to apply a rule such as (17) to label outliers/inliers (In principle, we could fit a classifier to the data, but this introduces additional variation in the benchmarking process). This drawback can be mitigated to some degree by also taking a subsample from the inlier class. In our case studies, a benchmark subsample is constructed by taking a subsample from both the outlier and inlier classes, with the subsample from the former labeled as outliers. Even with this subsampling approach, there could be cases where the observations labeled as inliers overlap significantly with those labeled outliers. This can be mitigated to a degree by constructing multiple benchmark subsamples using random subsamples according to the process described previously, and applying a method to each of these data sets in order to obtain an aggregate benchmark.
The preceding discussion acts as motivation for the following benchmark design. For each of the iris, breast cancer and ionosphere datasets, we generate three sets of benchmark subsamples by taking 100 random subsamples of sizes {3, 10, 20} from the outlier class. Every random subsample from the outlying class is accompanied by a random subsample from the inlier class. The size of the subsample from the inlier class is determined such that any benchmark subsample generated from iris, breast cancer or ionosphere is of size 100, 300, or 200, respectively. These sizes were selected proportional to the size of the inlier class. The observations sampled from the inlier and outlier class are then labeled as inliers and outliers, respectively.
For every benchmark subsample, the LOF scores, and DBSCAN clustering with the kneedle algorithm, corresponding to ∈  = {0.1n, 0.11n,…, 0.9n} are computed, with n being the size of the subsample taken as some multiple of 100. For each set of LOF scores, outliers are detected using the baseline, best F1 and DAST LOF methods as discussed in Section 5.2, and F1 scores are computed. For DBSCAN, the noise observations were used as outliers again, and the DAST and DBSCAN classification is determined as in Section 5.2.
The performance of a method for a specific dataset (iris, breast cancer, ionosphere), specific neighborhood size and number of outliers is measured by the average of the F1 scores obtained over all the associated benchmark subsamples. We can plot these average F1 scores as a function of the neighborhood size in order to compare the performance of the different methods for a particular dataset and number of outliers. Although these plots are informative, it is also important to consider how to obtain a single inlier/outlier classification based on the results obtained over the grid. For LOF, a simple approach is to summarize the scores obtained for a particular observation over the grid of neighborhood sizes into a single statistic, and then apply the baseline, F1 best or DAST to these summarized values to detect outliers. In our case studies, we used the maximum statistic as advocated by Breunig et al. (2000), more specifically the maximum LOF score for x i is computed as max ∈ { (x i )}. For DBSCAN an observation is labeled an outlier if it is detected as noise at least 50% of the time over . DAST & DBSCAN takes the classification obtained from DAST applied to the maximum LOF score, if it is able to find at least one outlier, and the DBSCAN classification otherwise. For a particular dataset and number of outliers, these grid-free methods are applied to all the corresponding benchmark subsamples and the average of the F1 scores are used to measure performance.
In the following figures the solid black, dashed red, dotted blue, double-dashed green and dotted-dashed purple lines always represent the best F1, baseline, DAST LOF, DBSCAN, and DAST and DBSCAN methods, respectively.

Iris data
The results of applying the benchmark process to the iris data are shown in Figure 19. We see that DAST LOF outperforms the baseline when 3 and 10 outliers are sampled from the setosa species. When 20 outliers are sampled, the performance of DAST and the baseline method are more similar than in the other cases. One potential reason for this is that 20 outliers compose 20% of the data, a sizeable proportion for outliers. We also observe that the performance of DAST LOF generally deteriorates for larger neighborhood sizes, the reason being that LOF scores tend to be more similar across observations when the neighborhood size grows. DAST and DBSCAN improves on DAST LOF, while DBSCAN performs the best for wide ranges of across all scenarios. The results of the grid-free methods are summarized in Table 3 and shows similar findings to those from Figure 19. The performance of DBSCAN in the grid-free setting is particularly good.

Wisconsin breast cancer diagnostic data
The results from applying the benchmark process to the breast cancer data are shown in Figure 20. We observe that DAST LOF outperforms the baseline for all number of outliers considered, while it performs better than DBSCAN for 3 and 10 outliers, and marginally better for 20 outliers for TA B L E 3 Average F1 scores achieved by the baseline, best F1, local outlier factor domain adapted sequential testing (DAST), DBSCAN, and DAST and DBSCAN methods for the ionosphere data. most . The performance of DAST and DBSCAN and DAST LOF is similar for 3 and 10 outliers, however the former performs better with 20 outliers. The results of the grid-free methods are summarized in Table 3 and shows similar findings to those from Figure 20. The performance of DAST and DBSCAN is the best among all methods (except best F1) and scenarios in the grid-free setting.

Ionosphere data
The results for the ionosphere data are shown in Figure 21. The performance of DBSCAN and the baseline methods are relatively similar, and DAST LOF only compares well for three outliers (Table 4). We see that DAST and DBSCAN outperforms DBSCAN over a wide range of values, and the baseline for almost all . The DAST and DBSCAN method performs the best in the grid-free setting among all cases and methods, except for best F1.

Summary
In the simulations and case studies we saw that DAST LOF tends to yield superior F1 scores compared to the baseline method. This serves as proof of concept that DAST LOF can be used in conjunction with outlier scoring methods to recommend a precise cut-off point for outlier identification. We argued that DAST LOF will do well when the LOF scores of outlying observations differ in tail-index from those of the inliers, otherwise it would be better to use a method such as

CONCLUSION
In this paper we provided a testing procedure for outlier detection based on extreme value methodology. The test statistic is based on the deviations of trimmed Hill statistics when trimming consecutive extreme data points. While the Hill estimator is only a consistent estimator in case of a positive EVI, we show that this statistic is still useful for outlier detection in all max-domains of attraction. As a practical consequence a tail-adjusted boxplot is proposed, allowing to indicate possible outliers depending on the tail heaviness of the underlying distribution. A proof of concept was provided that this tail-adjusted boxplot can be applied to outlier scores for the purpose of detecting outliers in multivariate data, by applying it to local outlier factors. In Section 5 a variety of alternatives to the LOF scoring method was provided. A comprehensive study of the use of these methods in conjunction with DAST is an important avenue for further research.  (1977). Exploratory data analysis. Addison-Wesley. Vandewalle, B., Beirlant, J., Christmann, A., & Hubert, M. (2007) 1: Input * and * . 2: For a given and , set k = ⌊n ⌋. 3: Set j ∈ (0, 1), j = 0, 1, … , k − 2 satisfying ∏ k−2 j=0 (1 − j ) = 1 − q 4: Compute U j,k as where T j,k is defined in (6).  • when < 0, where Y 1,n ≤ · · · ≤ Y n,n denote the order statistics of an i.i.d. sample from the standard Pareto distribution and Z 1 , Z 2 ,… an i.i.d. standard exponential sequence, with both sequences linked through the Rényi representation log Y n−i,n = ∑ n j=i+1 Z j ∕j, i = 0, 1,…, n − 1. To prove Proposition 1, we need a lemma concerning (k 0 ∕k)g(Y n−k 0 ,n ∕Y n−k,n ) + (1∕k) ∑ k−1 i=k 0 g(Y n−i,n ∕Y n−k,n ) for functions g(x) = x u , u < 0 where Y 1,n ≤ · · · ≤ Y n,n denote the order statistics of a sequence Y 1 ,…, Y n of i.i.d. standard Pareto random variables, that is, setting = 1 and = 1 in (2). Lemma 1. For any < 0 we have that as k, n → ∞, k∕n → 0 and k 0 = o(k) Proof. First note that For simplicity, we consider two cases k 0 = k 0 (n) ≤ M and k 0 = k 0 (n) → ∞. For any generic sequence k 0 = k 0 (n), we can mimic the arguments in lemma C.9 of Bhattacharya et al. (2019).
Since Y n−k,n P − − → ∞, we use (C4) with t = Y n−k,n and x = Y n−i,n ∕Y n−k,n , so that with probability tending to 1 Since Y n−k,n ∕(n∕k) P − − → 1 and Q 0 is regularly varying with index ′ , we have

Again using the Rényi representation gives
.
A representation of the denominator of the right-hand side of (C11) is given in Proposition 1. We now also develop the numerator V k 0 +1 in terms of the Z sequence of exponential random variables.