Statistical power estimation dataset for external validation GoF tests on EVT distribution

This paper presents the statistical power estimation of goodness-of-fit tests for Extreme Value Theory (EVT) distributions. The presented dataset provides quantitative information on the statistical power, in order to enable the sample size selection in external validation scenario. In particular, high precision estimations of the statistical power of KS, AD, and MAD goodness-of-fit tests have been computed using a Monte Carlo approach. The full raw dataset resulting from this analysis has been published as reference for future studies: https://doi.org/10.17632/hh2byrbbmf.1.


a b s t r a c t
This paper presents the statistical power estimation of goodnessof-fit tests for Extreme Value Theory (EVT) distributions. The presented dataset provides quantitative information on the statistical power, in order to enable the sample size selection in external validation scenario. In particular, high precision estimations of the statistical power of KS, AD, and MAD goodness-of-fit tests have been computed using a Monte Carlo approach. The full raw dataset resulting from this analysis has been published as reference for future studies: https://doi.org/10.17632/hh2byrbbmf.

Data
The dataset described in this paper provides an estimate of the statistical power of Goodness-Of-Fit (GoF) tests. Its analytical calculation is in fact usually not easy: for most of GoF tests a closed form expression does not even exist. This estimation is necessary to properly select the sample size for testing procedures, thus reducing the type-II errors, i.e. the inability to reject the null hypothesis when it is actually false. The availability of this dataset can be advantageous for several fields, where the selection of the sample size is often performed with empirical procedures and where the results are often interpreted in a too optimistic view [1]. The GoF tests aim at identifying the deviation of data samples from a given distribution. However, if the test is not able to identify such null hypothesis violation, nothing can be stated and the statistical power becomes the only quantitative value that gives us the test result reliability information. The GoF tests have not been studied in case 0 scenario (called also external validation) for EVT distributions, i.e. when the samples used to perform the test are a different set w.r.t. the samples used to estimate the reference distribution. In particular, to the best of our knowledge, quantitative information of only case 3 scenarios is available in literature [2], while no case 0 power analysis for such distribution classes is available in literature. This dataset wants exactly to fill this gap.
The statistical power computation has been performed with Monte Carlo approximations on a very large number of samples (10 9 ). This guaranteed a high level of accuracy of the results. This, together with the external validation scenario, is an interesting feature for recent applications of the EVT. One of the possible use-case of this dataset is probabilistic real-time computing [3], where EVT is used to estimate the probabilistic Worst-Case Execution Time (WCET) of the computer tasks. In this scenario, the confidence level of the statistical test is critical. A false-negative result may indeed lead to an underestimation of the WCET, which may be unacceptable for the production system [4]. This is the reason why we decided to build the statistical power dataset with the highest possible accuracy, enabling the selection of suitable sample size and ensuring a sufficient test result reliability [5].

Hypothesis testing and statistical power
In hypothesis testing, the null hypothesis (H 0 ) is rejected when the observed data strongly suggest that it is false, in favour to an alternative hypothesis (H 1 ). On the contrary, if the null hypothesis cannot be rejected, nothing can be inferred about the truthfulness of any hypothesis. The statistical power is defined as the probability to incur in a Type II error, i.e. the failure to reject the null hypothesis when it is Specification actually false. This concept can be expressed with the following conditional probability: Pðnot reject H 0 jH 0 is falseÞ. This work presents the estimated statistical power of three Goodness-of-Fit (GoF) tests: Kolmogorov-Smirnov (KS) [6], Anderson-Darling (AD) [7], and Modified Anderson-Darling (MAD) [8] for EVT distributions. Other common tests have been excluded, for example the Chi-Squared (CS) and Cramer-von Mises (CvM) test, because state-of-the-art works already showed that they have a lower statistical power with respect to KS or AD [9,10].
Regarding the specific EVT case, the work of Heo et al. [2] estimated the AD and MAD test critical values and power, by using a Montecarlo approach for GoF test of EVT distributions. The critical values were computed for a scenario where the model parameters to be tested were estimated from the same data used for the test. This scenario is commonly referred to as Case 3, i.e., the assumed distribution parameters are unknown. The a priori knowledge of the distribution parameters (Case 0) in fact, is not usually available for most of classical EVT applications. However, in some cases, e.g. the probabilistic real-time computing previously mentioned, we can easily increase the sample size, because getting new samples requires a low effort. For this reason, the Case 0 can be applied, by drawing different independent samples for model parameter estimation and for model validation. This enables the possibility to perform the external validation that leads, in general, to the most stringent and unbiased test [11].
Generally, statistical power estimations for Case 0 are not representative of Case 3 and vice versa. This makes the data provided with this paper extremely valuable, because they represent a highly accurate estimation of the GoF statistical power for the external validation scenario and EVT distributions.

Statistical power estimation
The EVT distributions can be grouped under the Generalized Extreme Value distribution: GEV ðm; s; xÞ, where m is the location parameter, s is the scale parameter, and x is the shape parameter. The location and scale parameters determine the linear transformation of the standard GEV, while the shape parameter determines the distribution class. In this work, we explored all the three GEV classes as distribution references: a Gumbel distribution GEVð0; 1; 0Þ, a Weibull distribution GEVð0; 1; À0:5Þ and a Fr echet distribution GEVð0; 1; 0:5Þ. For each of these distributions, the Goodness-of-Fit tests have been run on samples drawn from the other two GEV and from: a normal Nð0; 1Þ, a t-student tð10Þ, and a uniform distribution Uð À 2; 3Þ. The results for KS are shown in Table 1, for AD in Table 2, and for MAD in Table 3.

Sensitivity analysis
Given the statistical power results of the representative test cases, we performed a sensitivity analysis on the sample size and the shape parameter x of the GEV distribution. The results are depicted in Fig. 1, while the raw data are available in the dataset.

Experimental design, materials, and methods
The analytical computation of the statistical power, and consequently the selection of the appropriate sample size, is usually not possible, due to the frequent lack of the effect size knowledge, i.e. the real characterization of the population's distribution from which the samples have been collected. Consequently, Munthen et al. [12] studied the usage of Monte Carlo methods to select the sample size and determine the testing power. To this purpose, we need to define a set of tuples representing the test conditions. In particular, the Monte Carlo sampling is executed for every tuple ðD; n; a; G 1 ; G 2 Þ, where D is the statistic of the test under analysis, n is the sample size, a the level of significance, G 1 ; G 2 are respectively the reference distribution with cumulative distribution function FðxÞ and the empirical distribution with cumulative distribution function F n ðxÞ.
The statistics D for KS, AD and MAD test can be computed using their discretized forms [13e15]: ð2n À 2i þ 1ÞlogðFð1 À x i ÞÞ Fig. 1. Sensitivity plots for G 0 $ GEVð0; 1; 0Þ.G 1 $ GEVð0; 1; 0:5Þ The estimation algorithm is shown in Listing 2. For each scenario, the critical value is computed (line 2) and a large number of explorations N is performed (lines 3e10). Each time, we draw a sample from the reference distribution (line 4) and we check if the statistic D of the ecdf matches or not with the drawn sample, comparing it with the critical value (line 5). If the statistic value is higher than the critical value, then the sample is rejected (line 6), otherwise not (line 8). Finally, the ratio rejection over total samples provide us the statistical power (line 11). If the test is able to detect the differences between G 1 and G 2 we expect to get a value near 1 for this ratio. In this specific Monte Carlo simulation, the standard error of power can be computed as [17]: where R N is the number of rejects (the accumulation variable of line 12). The standard error is decreasing when N/∞ and when R/N, i.e. when statistical power approaches the maximum value 1.
The selected values for parameters of each Monte Carlo estimation are: a: the significance level. We studied the traditional values of 0.05 and 0.01.
The simulations ran on 4 nodes of CINECA supercomputing facility (GALILEO-A1 cluster, 2 x Intel Xeon E5-2697v4@2.3GHz per node) for a total of 144 CPU cores. It took z 13h for KS tests, z 17.5h for AD test, z 16h for MAD test.
Given the statistical power results of the representative test cases, we performed a sensitivity analysis on sample size n and shape parameter x. The power was obtained by using the same procedure of Algorithm 2, but reducing considerably the number of iterations N, in order to enable a fine-grain analysis with a sustainable computational effort. By exploring the integer sample size space and the real shape parameter space, the Monte Carlo simulations carry out a power matrix of sizes xÂ n (where , is the cardinality of the set of all the possible values of ,).