Comparative Performance of Four Single Extreme Outlier Discordancy Tests from Monte Carlo Simulations

Using highly precise and accurate Monte Carlo simulations of 20,000,000 replications and 102 independent simulation experiments with extremely low simulation errors and total uncertainties, we evaluated the performance of four single outlier discordancy tests (Grubbs test N2, Dixon test N8, skewness test N14, and kurtosis test N15) for normal samples of sizes 5 to 20. Statistical contaminations of a single observation resulting from parameters called δ from ±0.1 up to ±20 for modeling the slippage of central tendency or ε from ±1.1 up to ±200 for slippage of dispersion, as well as no contamination (δ = 0 and ε = ±1), were simulated. Because of the use of precise and accurate random and normally distributed simulated data, very large replications, and a large number of independent experiments, this paper presents a novel approach for precise and accurate estimations of power functions of four popular discordancy tests and, therefore, should not be considered as a simple simulation exercise unrelated to probability and statistics. From both criteria of the Power of Test proposed by Hayes and Kinsella and the Test Performance Criterion of Barnett and Lewis, Dixon test N8 performs less well than the other three tests. The overall performance of these four tests could be summarized as N2≅N15 > N14 > N8.


Introduction
As summarized by Barnett and Lewis [1], a large number of discordancy tests are available for determining an outlier as an extreme (i.e., legitimate) or a discordant (i.e., contaminant) observation in normal samples at a given confidence or significance level. These discordancy tests are likely to be characterized by different power or performance. Numerous researchers [1][2][3][4][5][6] have commented on the properties of these tests under the slippage of location or central tendency and slippage of scale or dispersion by one or more observations, but very few studies have been reported on the use of Monte Carlo simulation for precise and accurate performance measures of these tests. Relatively more recently using Monte Carlo simulation of = 100, 000 replications or runs, Hayes and Kinsella [7] evaluated the performance criteria of two discordancy tests (Grubbs single outlier test N2 and Grubbs multiple outlier test N4k2; the nomenclature is after Barnett and Lewis [1]) and discussed their spurious and nonspurious components of type II error and power function. However, four single extreme outlier type discordancy tests, also called two-sided discordancy tests by Barnett and Lewis [1], are available, which are Grubbs type N2, Dixon type N8, skewness N14, and kurtosis N15. Their relative performance measures should be useful for choosing among the different tests for specific applications. The Scientific World Journal Monte Carlo simulation methods have been extensively used in numerous simulation studies [8][9][10][11][12][13][14][15][16][17][18]. Some of the relatively recent papers are by Efstathiou [12], Gottardo et al. [13], Khedhiri and Montasser [14], P. A. Patel and J. S. Patel [15], Noughabi and Arghami [16], Krishnamoorthy and Lian [17], and Verma [18]. For example, Noughabi and Arghami [16] compared seven normality tests (Kolmogorov-Smirnov, Anderson-Darling, Kuiper, Jarque-Bera, Cramer von Mises, Shapiro-Wilk, and Vasicek) for sample sizes of 10, 20, 30, and 50 and under different circumstances recommended the use of Jarque-Bera, Anderson-Darling, Shapiro-Wilk, and Vasicek tests.
We used Monte Carlo simulations to evaluate comparative efficiency of four extreme outlier type discordancy tests (N2, N8, N14, and N15, the nomenclature after Barnett and Lewis [1]) for sample sizes of 5 to 20. Our approach to the statistical problem of test performance is novel because, instead of using commercial or freely available software, we programmed and generated extremely precise and accurate random numbers and normally distributed data, used very large replications of 20,000,000, performed 102 independent experiments, and reduced the simulation errors to such an extent that the differences in test performance are far greater than the total uncertainties expressed as 99% confidence intervals of the mean. This is an approach hitherto practiced by none (see, e.g., [8][9][10][11][12][13][14][15][16][17][18]) except by our group [19][20][21][22][23]. This work, therefore, supersedes the approximate simulation results of test performance reported by the statisticians Hayes and Kinsella [7].

Discordancy Tests
For a data array 1 , 2 , 3 , . . . , −2 , −1 , or an ordered array (1) , (2) , (3) , . . . , ( −2) , ( −1) , ( ) of observations, with mean and standard deviation , four test statistics were objectively evaluated in this work. For a statistically contaminated sample of size of 5 to 20, − 1 observations of this data array were obtained from a normal distribution (0, 1) and the remaining observation was taken from a central tendency shifted distribution (0+ , 1) or dispersion shifted distribution (0, 1 × ), where the contaminant parameters for modeling slippage of central tendency and for slippage of dispersion can be either positive or negative. For an uncontaminated sample, the simulations were done for = 0 and = ±1. In order to achieve an unbiased comparison, the application of the tests was always forced to the upper outlier ( ) for positive values of or and to the lower outlier (1) for negative values of or .
The third test was sample skewness N14 as (note that the absolute value is used for evaluation): tested if < 0 or < 1 The Scientific World Journal 3 Finally, the fourth test was the sample kurtosis N15 as follows: All tests were applied at a strict 99% confidence level using the new precise and accurate critical values (CV 99 ) simulated using Monte Carlo procedure by Verma et al. [19] for N2, N8, and N15 and Verma and Quiroz-Ruiz [20] for N14, which permitted an objective comparison of their performance.

Monte Carlo Simulations
Random numbers (0, 1) uniformly distributed in the interval (0, 1) and normal random variates (0, 1) were generated from the method summarized by Verma and Quiroz-Ruiz [21]. However, instead of only 10 series or streams of (0, 1) as done by Verma and Quiroz-Ruiz [21], a total of 102 different streams of (0, 1) were simulated. Similarly, the replications were much more than those used by Verma and Quiroz-Ruiz [21] for generating precise and accurate critical values.
Now, if we were to arrange the complete array from the lowest to the highest observations, the ordered array could be called (1) , (2) , (3) , . . . , ( −2) , ( −1) , ( ) after Barnett and Lewis [1]. All four tests under evaluation could then be applied to the resulting data array.
To an event of type when any of these four tests (N2, N8, N14, or N15) was applied, the outcome was called either a spurious type II error probability ( ) if the test was not significant or a spurious power ( ) if it was significant ( Table 1). For this decision, the calculated test statistic TN (TN2, TN8, TN14, or TN15) for a sample was compared with the respective CV 99 [19,20]. If TN ≤ CV 99 , the outcome of the test was considered as not significant; else when TN > CV 99 , the outcome of the test was considered as significant (Table 1).
Similarly, to an event of type, when a discordancy test was applied, the outcome was either a nonspurious type II error probability ( ) if the test was not significant or a nonspurious power ( ) if the test was significant (Table 1). If = 0 or = ±1 (the contaminant absent) and a discordancy test was applied to the ordered array (1) , (2) , (3) , . . . , ( −2) , ( −1) , ( ) to evaluate the extreme observation ( ) or (1) , the outcome would either be a true negative (the respective probability ) if the test was not significant, that is, if it failed to detect ( ) or (1) as discordant, or a type I error (probability ) if the test was significant; that is, it succeeded in detecting ( ) or (1) as discordant (Table 1).

Test Performance Criteria
Hayes and Kinsella [7] documented that a good discordancy test would be characterized by a high nonspurious power probability (high ), a low spurious power probability (low ), and a low nonspurious type II error probability (low ). Hayes and Kinsella [7] defined the Power of Test (Ω) as Similarly, they also defined the Test Performance Criterion | (which is equivalent to the probability P5 of Barnett and Lewis [1]) or the Conditional Power as

Optimum Replications
The optimum replications ( ) required for minimizing the errors of Monte Carlo simulations were decided from representative results summarized in Figures 1 and 2    Similarly, Ω for all four tests as a function of replications is also shown in Figures 2(a)-2(d), which allows a visual comparison of this performance parameter for different sample sizes and values. Error bars ( 99 ) for the 102 simulation experiments are not shown for simplicity, but, for replications larger than 10,000,000, they were certainly within the size of the symbols. The replications of 20,000,000 routinely used for comparing the performance of discordancy tests clearly show that the differences among Ω values (Figures 2(a)-2(d)) are statistically significant at a high confidence level; that is, these differences are much greater than the simulation errors.
Alternatively, following Krishnamoorthy and Lian [17] the simulation error for the replications of 20,000,000 used routinely in our work can be estimated approximately as 2 × √0.5 × 0.5/20000000 = 0.00022.
Because we carried out 102 independent simulation experiments, each with 20,000,000 replications, our simulation errors were even less than the above value. Thus, the Monte Carlo simulations can be considered highly precise. They can also be said to be highly accurate, because our procedure was modified after the highly precise and accurate method of Verma and Quiroz-Ruiz [21]. These authors had shown high precision and accuracy of each (0, 1) and (0, 1) experiments and had also applied all kinds of simulated data quality tests suggested by Law and Kelton [25]. Besides, in the present work a large number of such experiments (204 streams of (0, 1) and 102 streams of (0, 1)) have been carried out. Therefore, as an innovation in Monte Carlo simulations we present the mean ( ) values as well as the total uncertainty ( 99 ) of 102 independent experiments in terms of the confidence interval of the mean at the strict 99% confidence level.
Finally, in order to evaluate the test performance, test N2 was used as a reference and differences in mean (Δ N ) values of the other three tests were calculated as where the subscript N stands for N8, N14, or N15.

Type and Contaminant-Absent Events.
According to Barnett and Lewis [1] this type of events is of no major concern, because the contaminant occupies an inner position in the ordered array and the extreme observation ( ) or (1) under evaluation from discordancy tests is a legitimate observation. An inner position of the contaminant would affect much less the sample mean and standard deviation [1].  The Scientific World Journal  The Scientific World Journal 7                  values show a similar behavior for larger values of than for (compare Figures 7 and 8 with Figures 5 and 6, resp.). There are some differences in these probability values among the different tests (Table 2; Figures 5-8), but they will be better discussed for the test performance criteria.

Test Performance Criteria (Ω and
| ). These two parameters are plotted as a function of and in Figures 9,   10, 11, and 12 and the most important results are summarized in Tables 3-6. For a good test, both Ω ( + ; (5)) and | (6) should be large [1,7]. Values of both performance criteria (Ω and | ) increase as or values depart from the uncontaminated values of = 0 or = ±1 (Figures 9-12; Tables 3-6). However, Ω and | increase less rapidly for smaller than for larger . For = 5, even for = ±20 or = ±200, none of the two parameters truly reaches the maximum theoretical value of 0.99 (Figure 9(a) to Figure 12(a)).    Tables 3-6).
The performance differences of the four tests are now briefly discussed in terms of both and as well as . The total uncertainty 99 values of the simulations are extremely small (the error is at the fifth or even sixth decimal place; Tables 3-6). Therefore, most differences among the tests (Δ N8 for test N8, Δ N14 for test N14, and Δ N15 for test N15; all percent differences are with respect to test N2; see (7)) are statistically significant (Tables 3-6). A negative value of Δ N (where N stands for N8, N14, or N15) means that Ω or | value for a given test (N8, N14, or N15) is less than that of test N2, implying a worse performance of the given test as compared to test N2, whereas a positive value of Δ N signifies just the opposite. Note that test N2 is chosen as a reference test, because it shows generally the best performance (values of   Tables 3-6). Additional fine-scale simulations were also carried out for which both Ω and | become about 0.5 for the reference test N2 (0.5 is about the half of the maximum value of one for Ω or | ). Hence, the values of Ω and | can be visually compared in Tables 3-6 (see the rows in italic font). For = 5, all tests show rather similar performance, because the maximum difference (Δ N ) is only about −1.1% for N8 (as compared to N2) and < −0.1% for N14 and N15 (see the first set of rows corresponding to = 5 in Tables  3-6). Test N2 shows Ω = 0.50044 for = ±10.17, whereas tests N8, N14, and N15 have Ω values of 0.49503, 0.50014, and 0.50015, respectively, ( Table 3). The respective Δ N values are about −1.1%, −0.06%, and −0.06% (Table 3). Practically the same results are valid for | as well (see the row in italic font in Table 4). Similar results were documented for Ω and | as a function of (rows for = ±12.9 or ±13.1 in Tables  5 and 6, resp.).
The significantly lower Ω and | values of the Dixon test N8 as compared to the Grubbs test N2, skewness test N14, and kurtosis test N15 may be related to the masking effect of the penultimate observation ( −1) on ( ) or of (2) on (1) The Scientific World Journal  as documented by Barnett and Lewis [1]. The masking effect may also be responsible for a somewhat worse performance of N14 as compared to N2.
6.4. Final Remarks. The two performance criteria (Ω and | ) [1,7] used in this work provide similar estimates (Tables 3-6) and, more importantly, similar conclusions. Therefore, any of them can be used to evaluate numerous other discordancy tests for single or multiple outliers [1,[26][27][28]. The main result of Monte Carlo simulations concerning the performance of the single extreme outlier discordancy tests could be stated as follows: N2 ≅ N15 > N14 > N8.
Additional simulation work is required to evaluate other discordancy tests, such as the single upper or lower outlier tests, as well as more complex statistical contamination involving two or more discordant outliers and the comparison of consecutive application of single outlier discordancy tests with multiple outlier tests [1,7,[26][27][28]. Then, the multiple test method, initially proposed by Verma [29] and used by many researchers [30][31][32][33][34][35], would be substantially improved for subsequent applications. These performance results could then be incorporated in new versions of the computer programs DODESSYS [36], TecD [37], and UDASYS [38].

Conclusions
Our simulation study clearly shows that Dixon test N8 performs less well than the other three extreme single outlier tests (Grubbs N2, skewness N14, and kurtosis N15). Both performance parameters (the Power of Test Ω and Test Performance Criterion | ) have up to about 16% less values for N8 than test N2. Test N8, therefore, shows the worst performance for outlier detection. For certain values of or test N14 also shows lesser values of Ω and | than N2, which means that N14 is also somewhat worse than N2. The other two tests (N2 and N15) could be considered comparable in their performance.