Entropy Estimation From Ranked Set Samples With Application to Test of Fit

This article deals with entropy estimation using ranked set sampling (RSS). Some estimators are developed based on the empirical distribution function and its nonparametric maximum likelihood competitor. The suggested entropy estimators have smaller root mean squared errors than the other entropy estimators in the literature. The proposed estimators are then used to construct goodness of (cid:28)t tests for


Introduction
In situations where exact measurements of sample units are expensive or difcult to obtain, but ranking them (in small sets) is cheap or easy, ranked set sampling (RSS) scheme is an appropriate alternative to simple random sampling (SRS).It often leads to improved statistical inference as compared with SRS.This sampling strategy was proposed by McIntyre (1952) for estimating the mean of pasture yield.He noticed that while obtaining exact value of yield of a plot is dicult and time-consuming, one can simply rank adjacent plots in terms of their pasture yield by eye inspection.The RSS scheme can be described as follows: 1. Draw a simple random sample of size k 2 from the population of interest, and then partition them into k samples of size k.
2. Rank each sample of size k in an increasing magnitude of the variable of interest without obtaining precise values of the sample units.The Ranking process in this step can be done based on personal judgement, eye inspection or using a concomitant variable, and need not to be accurate.
3. Actually obtain the exact measurement of the unit with rank r in the rth sample (for r = 1, . . ., k).
Let X [r]s : r = 1, . . ., k; s = 1, . . ., n be a ranked set sample of size N = nk, where X [r]s is the rth judgement ordered unit from the sth cycle.The term judgement order implies that the ranking process in step 2 in the above is done without referring to precise values of the sample units, and therefore it may be inaccurate (imperfect).Thus, the rthe judgement ordered unit and the true rth ordered unit may be dierent.Note that all sample units in the RSS scheme are independent, but not identically distributed.For r = 1, . . ., k, X [r]1 , . . ., X [r]n are independent and identically distributed sample units, and they follow the distribution of the rth judgement order statistic in a sample of size k.In the sequel, the subscript [•] in X [r]s is used to indicate that ranking process that may not be perfect.In the case of perfect ranking (i.e. the ranking process in step 2 in the above is accurate), we replace [•] by (•) in the subscript of X [r]s .
It should be noted that the application of RSS scheme is not limited to agricultural problems.It can be applied in any situations where ranking observations are much easier than measuring them.Some other potential applications of RSS scheme are in forestry (Halls and Dell, 1966), medicine (Chen, Stasny & Wolfe 2005), environmental monitoring (Nussbaum & Sinha 1997, Kvam 2003, Ozturk, Bilgin & Wolfe 2005) and entomology (Howard, Jones, Mauldin & Beal 1982).
The RSS estimator of the population mean is given by Revista Colombiana de Estadística 40 (2017) 223241 There has been a lot of research in RSS scheme since its introduction.Takahasi & Wakimoto (1968) proved the X RSS is an unbiased estimator of the population mean and has less variance than X SRS , the sample mean in SRS scheme.The problem of variance estimation in RSS scheme has been considered by Stokes (1980), MacEachern, Ozturk, Wolfe & Stark (2002), Perron & Sinha (2004) and Zamanzade & Vock (2015).The empirical distribution function (EDF) in RSS scheme is given by (1) Stokes & Sager (1988) proved that this estimator is unbiased and has smaller variance than the EDF in SRS scheme for a xed total sample size (N ), regardless of ranking errors.It can bee seen that as n → ∞, where and F [r] is cumulative distribution function (CDF) of the rth judgement order statistic in a sample of size k.
Let X (r)s : r = 1, . . ., k; s = 1, . . ., n be a ranked set sample of size N = nk collected under the perfect ranking assumption.Accordingly, X (r)s follows the distribution of rth true order statistic in a sample of size k.For r = 1, . . ., k, Y r = n s=1 I X (r)s ≤ t has a binomial distribution with mass parameter n, and success probability B r,k+1−r (F (t)), where is CDF of beta distribution with parameters r and k + 1 − r computed at the point F (t). Thus, the log-likelihood function of (Y 1 , . . ., Y n ) can be written as It can be shown that L (F (t)) is strictly concave in F (t). Therefore, the maximum likelihood estimator of CDF is dened as (2) This estimator was introduced by Kvam & Samaniego (1994), and its asymptotic behavior was studied by Huang (1997), andDuembgen &Zamanzade (2013).As n → ∞, we have where , and f (r) (t) is the probability density function (pdf) of rth order statistic in a sample of size k.
It can be shown that σ 2 L ≤ σ 2 em , and therefore FL is asymptotically more ecient than Fem under perfect ranking assumption.
Several variations of the RSS design have been developed to facilitate ecient estimation of the population parameters.For example, Samawi, Abu-Daayeh & Ahmed (1996) proposed extreme ranked set sampling to decrease the ranking error.He showed that the sample mean in extreme ranked set sampling is unbiased, and outperforms its counterpart in SRS of the same size.Muttlak (1996) proposed pair ranked set sampling to reduce the number required observations for ranking in RSS scheme by half.Median ranked set sampling has been proposed by Muttlak (1997), and it was shown that the corresponding mean estimator is more ecient that X RSS for symmetric distributions.Haq, Brown, Moltchanova & Al-Omari (2014) proposed mixed ranked set sampling design to mix both SRS and RSS designs.
In Section 2, we propose some nonparametric estimators for entropy in RSS scheme.We then compare dierent entropy estimators via Monte Carlo simulation.In Section 3, we employ the proposed entropy estimators in developing entropy based tests of t for inverse Gaussian distribution.We then compare the powers of the proposed tests with their rivals in the literature.A real data example is presented in Section 4. We end with a conclusion in Section 5.

Estimation of Entropy in SRS and RSS Schemes
The entropy of a continuous random variable X is dened by Shannon (1948) as (3) Since the notion of entropy has wide applications in statistics, engineering and information sciences, the problem of estimation of H (f ) has been frequently addressed by many researchers.Vasicek (1976) was the rst who proposed to estimate H (f ) based on spacings.He noted that equation (3) can be rewritten as (4) Vasicek (1976) suggested to estimate equation (4) by using the EDF and applying dierence operator instead of dierential operator.
Let X 1 , . . ., X N be a simple random sample of size N from a population of the interest.The Vasicek's (1976) entropy estimator is given by where X (1) , . . ., X (N ) are ordered values of the simple random sample, m ≤ N 2 is an integer which is called window size, X (i) = X (1) for i < 1, and Ebrahimi, Habibullah & Soo (1994) modied Vasicek's (1976) entropy estimator by assigning less weights to the observations at the boundaries in equation ( 5) which are replaced by X (1) and X (N ) .Their proposed estimator has the form where Simulation results of Ebrahimi et al. (1994) showed that H E has smaller bias and mean square error (MSE) than Vasicek's (1976) entropy estimator.
Another modication of Vasicek's (1976) entropy estimator has been proposed by Correa (1995).He noted that equation (5) can be rewritten as The inside of the brackets in the above equation is the slope of the straight line which joins the points Correa (1995) suggested to estimate this slope by local linear regression and using all 2m + 1 points instead of only two points.His suggested entropy estimator has the form where Correa's (1995) simulation results indicate that H C generally produces less MSE than H V .
The problem of entropy estimation in RSS scheme has been considered by Mahdizadeh & Arghami (2009).Let X [r]s : r = 1, . . ., k; s = 1, . . ., n be a ranked set sample of size N = nk with ordered values Z 1 , . . ., Z N .Mahdizadeh and Arghami (2009)'s entropy estimator is is given by where Mahdizadeh & Arghami's ( 2009) simulation results indicate that H M is superior to its counterpart in SRS, H V .They then develop an entropy based goodness of t test for inverse Gaussian distribution in RSS scheme.Mahdizadeh (2012) used this entropy estimator for developing test of t for the Laplace distribution based on a ranked set sample.
The rst and the second estimators we propose in this paper, are motivated by Ebrahimi et al.'s (1994) entropy estimator in SRS.Their estimator can be rewritten as , where Thus, an analogous entropy estimators in RSS scheme can be developed as where w ∈ {em, L}, with Fem and FL dened in (1) and (2), respectively.
We can also modify Correa's (1995) entropy estimator to be applied in RSS scheme.Correa's type RSS estimators of entropy have the form In the following of this section, we compare dierent entropy estimators by using Monte Carlo simulation.We have generated 50,000 random samples of size 10, 20, 30 and 50 in RSS.The set size value is taken to be 2 and 5. So, we can assess the eect of increasing total sample size (N ) for a xed set size, and also the eect of increasing set size (k) for a xed sample size, on the performance of the estimators in the RSS setting.The ranking process is done by using fraction of random ranking due to Frey, Ozturk & Deshpande (2007).In this imperfect ranking scenario, it is assumed that the rth judgement order statistic is the true rth order statistic with probability λ, and it is selected randomly with probability 1 − λ.Therefore, the distribution of rth judgement order statistic is given by In this simulation study, the values of λ are taken to be λ = 1 (perfect ranking), λ = 0.8 (nearly perfect ranking), λ = 0.5 (moderate ranking) and λ = 0.2 (almost random ranking).The selection of window size (m), which minimizes MSE of the entropy estimator, is still an open problem in the entropy estimation context.We have used the Grzegorzewski & Wieczorkowski's (1999) heuristic formula to select m subject to N in the entropy estimators as follows In order to compare dierent entropy estimators, we have reported the root of mean squared error (RMSE) of dierent entropy estimators for standard uniform (U (0, 1)), standard exponential (Exp(1)) and standard normal (N (0, 1)) distributions in Tables 1-3, respectively.We develop some entropy based goodness of t tests for inverse Gaussian distribution using RSS.The pdf of a continuous random variable X with inverse Gaussian distribution is given by where µ > 0 and λ > 0. The CDF of a random variable X with inverse Gaussian distribution is given by where Φ (.) is CDF of the standard normal distribution.We refer the interested reader to Sanhueza, Leiva & López-Kleine (2011) for more information about the properties of this distribution.Vasicek (1976) was the rst who developed an entropy based goodness of t test for normal distribution by using a characterization of normal distribution based on entropy.Since then, many researchers developed entropy based tests of t for many well known distributions by characterizing them in terms of entropy.Mudholkar & Tian (2002) presented the following characterization of the inverse Gaussian distribution, and used it to develop a test of t.
Theorem 1. (Mudholkar & Tian 2002).The random variable X with inverse Gaussian distribution is characterized by the property that 1/ √ X attains the maximum entropy among all non-negative, and continuous random variables Y with a given value at E Y 2 − 1/E Y −2 .Let x (1) , . . ., x (N ) be observed ordered values of a simple random sample of size N from a continuous population with pdf f (x).Let y i = 1/ √ x (N +1−i) , for i = 1, . . ., N .Mudholkar & Tian (2002) suggested to reject the composite null hypothesis where f µ,λ (x) is the pdf in ( 11), H V (y) is Vasicek's (1976) entropy estimator based on y i values, w 2 = N i=1 1/x (i) − 1/x , and T V,α is the 100α percentile of the null distribution of T V .
One can also substitute Vasicek's (1976) entropy estimator in (12) with Correa's (1995) entropy estimator, and construct a test of t for inverse Gaussian distribution.
Let x [r]s : r = 1, . . ., k; s = 1, . . ., n be an observed ranked set sample of size N = nk from a continuous population with pdf f (x).Let z 1 , . . ., z N be the ordered values of the ranked set sample, and By following lines of Mudholkar & Tian (2002), we propose to reject the null composite hypothesis A ∈ {V, E, C} and B ∈ {em, L}.Also, H B A (y * ) is the entropy estimator based on A which is obtained under assumption of perfect ranking.It is worth noting that the critical values of the all above entropy based tests cannot be obtained analytically because of complicated form of the corresponding test statistics.Thus, the critical values of the entropy based tests of t should be obtained via Monte Carlo simulation.
Remark 1.In line with Ebrahimi et al. (1994), one can simply show that , and therefore the goodness of t tests based on H M and H em E are equivalent.In the sequel, we compare dierent entropy based tests of size 0.05 for inverse Gaussian distribution in RSS.For N = 10, 20 and 50, we have generated 50,000 random samples in RSS scheme, so we can observe the performance of the tests when sample size is small (N = 10), moderate (N = 20), and large (N = 50).The value of the set size (k) in the RSS setting is taken to be 2 and 5, therefore we can assess the eect of increasing set size on the goodness of t tests.The scenario of imperfect ranking is fraction of random ranking as described in previous section, and the value of λ (the fraction of perfect ranking) is taken to be 1, 0.8, 0.5 and 0.2.The alternative distributions which have been used in this simulation study are standard exponential distribution (Exp(1)), Weibull distribution with shape parameter 2 and scale parameter 1 (W (2, 1)), lognormal distribution with mean e 2 and standard error e 2 √ e 4 − 1 (LN (0, 2)), beta distribution with parameters 2, 2 (Beta(2, 2)) and beta distribution with parameters 5 and 2 (Beta(5, 2)).We also considered standard inverse Gaussian distribution (IG(1, 1)) to assess dierent tests in terms of type I error rate control.This is important because the critical values of the entropy based tests are obtained under assumption of perfect ranking.Figure 1 shows the pdf of the alternative distributions.It is clear from this gure that a variety of functional forms of pdfs are considered in the simulation study.The value of window size (m) plays a signicant role in entropy based goodness of t tests.Given a sample size, the optimum value of the window size which produces maximum power of each test depends on the alternative distribution.Since the alternative distribution is unknown in practice, it is not possible to determine a single optimum value for m.In Table 4, we present the suggested value of m subject to N which gives relatively good powers for all alternatives considered in this simulation study.In the simulation study, the value of m is selected according to the Table 4.The simulation results are presented in Tables 5-7.Table 5 presents the simulation results for N = 10.We observe from this table that the power of all goodness of t tests increase with the set size (k) while the other parameters are xed.It is also evident that the powers of all tests decrease when the value of λ goes from one to zero (from perfect ranking case to imperfect ranking case).While in perfect ranking setup (λ = 1), the tests based on H M and H L C are most powerful ones, T M beats the others in imperfect ranking setup.It is of interest to note that T L C is the least powerful test for the case of imperfect ranking (λ < 1).
The estimated powers of goodness of t tests for sample size N = 20 and 50 are reported in Tables 6-7.As one expects, the powers of all tests increase with the sample size (N ).The test based on H M is the most powerful test in most considered cases and H L C is the least powerful test in the case of imperfect ranking.The data set used in this section is obtained by Murray, Ridout & Cross (2000) and is known as apple tree data set.This data set is a result of a research study in which apple trees are sprayed with chemical containing uorescent tracer, Tinopal CBS-X, at 2% concentration level in water, and is given in Table 5 of Mahdizadeh & Zamanzade (2016).The variable of interest is the percentage of each leaf's upper surface area which is covered with spray deposit.It is important to note that the exact measurement of variable of interest requires chemical analysis of the solution collected from the surface of the leaves which is expensive and timeconsuming.On the other hand, an expert can use the visual appearance of the spray deposits on the leaf surfaces under ultraviolet light for ranking them within each set.Therefore, RSS can be regarded as oering the potential for improving statistical inference over SRS.Murray et al. (2000) collected data by using RSS with set size 5 and cycle size 10 in two dierent groups (low and high volumes of spray).Suppose that we are interested in tting a statistical model on two groups of apple tree data set.The entropy-based goodness of t test statistic value (TSV) along with its critical value (CV) at signicance level α = 0.05 for inverse Gaussian distribution are given in Table 8.By comparing each test statistic with the corresponding critical value, we conclude that the two data sets do not follow inverse Gaussian distribution.

Conclusion
In this paper, we employed empirical and maximum likelihood estimators of CDF for developing some entropy based tests for inverse Gaussian distribution in RSS scheme.We observe that although the entropy estimators based on maximum likelihood estimation of CDF have good performance in terms of RMSE, the corresponding tests are not successful when the ranking is not perfect.Since the quality of ranking in RSS is often unknown in practice, we recommend to use test of t for inverse Gaussian based on empirical distribution function.

Figure 1 :
Figure 1: The pdf of dierent alternative distributions.

Table 1 :
Power estimates of dierent entropy based tests for inverse Gaussian distribution for

Table 3 :
Monte Carlo estimates of RMSE of dierent entropy estimators for Table1gives the simulation results when the parent distribution is standard uniform.We observe from this table that H em E has less RMSE than H M for all considered values of N , k and λ.We also observe that the performances of all estimators improve as the sample size (N ) or set size (k) increases while the other parameters are xed.It is also interesting to note that H em C and H L C are the best entropy estimators in terms of RMSE, and the dierences in their performances are negligible.

Table 4 :
Suggested values of m subject to N in entropy based tests for inverse Gaussian distribution.

Table 5 :
Power estimates of dierent entropy based tests for inverse Gaussian distribution for N = 10 in RSS.

Table 6 :
Power estimates of dierent entropy based tests for inverse Gaussian distribution for N = 20 in RSS.

Table 7 :
Power estimates of dierent entropy based tests for inverse Gaussian distribution for N = 50 in RSS.
Finally, we would like to mention that all simulation studies in this work are programmed using R statistical software, and the corresponding code is available on request from the rst author.

Table 8 :
Entropy-based goodness of t test of inverse Gaussian distribution for apple tree data set.