Estimation of the population distribution function using varied L ranked set sampling

A generalized ranked set sampling (RSS) plan has recently been provided in the literature called varied L RSS (VLRSS). It is shown that VLRSS encompasses several existing RSS variations and also it efficiently estimates the population mean. In this article, we extend the work and consider estimating the cumulative distribution function (CDF) using VLRSS. Three new CDF estimators are proposed and their asymptotic properties are also theoretically investigated. Taking into account the information supported by the unmeasured sampling units, we also propose a general class of CDF estimators. Using small Monte Carlo experiments, we study the behavior of the proposed CDF estimators with respect to the conventional CDF estimator under RSS. It is found that the conventional RSS-based CDF is outperformed by at least one of VLRSS-based CDF estimators in most of the considered cases. Finally, an empirical example is utilized to illustrate the potential application of the proposed estimators.


Introduction
One of the most popularly used sampling methods is Ranked Set Sampling (RSS).
As, it is very helpful in the situations where ranking can be done at a cost that is not problematic relative to the cost of making accurate quantification on the interested variable.For instance, assume it is required to estimate the average height of trees in a forest.Hence, a small set of selected trees can be ranked visually with respect to their heights rather than actual measurements of the height of trees.Although RSS was firstly applied in agriculture field by McIntyre (1952), it has subsequently been adopted for many other branches including environmental monitoring (Kvam 2003)
Due to the powerful and the applicability of RSS, many studies have examined the performance of almost every standard statistical problem under RSS and its variations.
We now shortly review some newly important studies as follows: Mahdizadeh and Zamanzade (2016) developed a kernel-based estimator of a dynamic reliability statistic based on RSS.Zamanzade andMahdizadeh (2017, 2018) addressed the population proportion under RSS and its variations.Frey 2021) considered the stress-strength reliability for generalized inverted exponential distribution using RSS and its variation methods.
To get RSS, randomly draw  sets each of size  units.Then rank the  sets separately in ascending way without numerically measuring the unit.The ranking process can be done using auxiliary variable or subjectively.Finally, quantify exactly the  ℎ smallest unit from  ℎ set, where  = 1,2 … .These steps are considered as a cycle.Repeat, if needed, the cycle  times to obtain RSS of size  = .The RSS steps at  = 1 are illustrated in Fig. 1, where   is  ℎ unit from the  ℎ sample.It may be important to emphasis that the ranking process may be inaccurate and contains error, this situation is called imperfect ranking.Oppositely, perfect ranking is a situation in which the ranks of the sampling units are done without errors.Symbolically, RSS is denoted by { (:) :  = 1 …  ;  = 1 … }.Here,  ( * :) is the  * order statistic from the  ℎ sample of size  in the  ℎ cycle.The rounded parentheses of  (:) implies that the perfect ranking situation is assumed.In such cases, for a fixed ,  (:) follows the distribution of the  ℎ order statistic from a sample of size .While if the imperfect ranking situation was assumed, then the rounded parentheses will be replaced with squared ones.On the other hand, if the number of the selected sampling items is fixed  Step 1: Identify the value of the VLRSS coefficient  = [] such that 0 ≤  < .5,where [] is a largest integer value less than or equal to , and  is the set size.

Fig. 2. Demonstration of VLRSS procedure for one cycle
Step 2: Select randomly  samples each of size  1 from the interested population, where  1 ≥  or  1 <  is a determined value based on cost-or budget-constraints.
Step 3: Select the  ℎ smallest ranked unit from each  sets obtained by Step 2, where Step 4: Repeat Step 2 for getting another  samples each of size  1 .
Step 5: Select this time the ( 1 −  + 1) ℎ smallest ranked unit from each of the  sets obtained by Step 4.
Step 6: Select again ( − 2) units from the interested population and allocate them into ( − 2) sets, each having  units.
Step 8: This completes one cycle of VLRSS of size .The preceding steps 1-7 can be repeated  times to obtain VLRSS of size  = .
The VLRSS mechanism with one cycle is illustrated in Fig. ).).
Under the assumption of the consistency of the ranking process which implies that: ).
On the other hand, the variance of  ̂(),  ( ̂()), can be written as: where  , () is the CDF of the Beta distribution with parameters  and  at the point .Thus, the bias of  ̂() tends to zero as  → ∞.While the variance of  ̂() tends to zero as either  or  → ∞.This advocates that the consistency of  ̂() which completes the proof.
(b) Let: Since   's are iid with finite mean and variance, then by the Central Limit Theorem which achieves to the desirable result.

CDF Estimation Using Likelihood Function
In this part, we consider the maximum likelihood estimation (MLE) for estimation of CDF to enhance the precision of  ̂(), as the latter does not efficiently incorporate the information supported by the ranking process.Let To facilitate the notation,    () is written as   .Assuming the perfectness, the elements of   have a Bernoulli distribution, then the corresponding likelihood functions can be written as: Then, the log-likelihood function is: Therefore our second proposed CDF estimator,  ̂(), based on VLRSS is obtained by maximizing () or equivalently maximizing ℒ() as shown below: where  , () is the pdf of the Beta distribution with parameters  and  at the point .
One can easily verify that VLRSS is given by: where: and Hence, the variance of  ̂() can be expressed as: .
The following proposition shows the asymptotic properties of  ̂().].

Proposition
= () + () + (), where the three terms between brackets will be denoted respectively by (), () and ().Using Taylor series expansion, the logarithm function of the first term () can be approximated as: ) .
Under the assumption of the consistency of the ranking process, we will obtain: ).( 6) On the other hand, the asymptotic variance: where the three terms between the brackets shown in (7) will be denoted respectively by  1 ,  2 and  3 .
Since: After simple algebra, we can get: Under the assumption ℎ → 0 as  → ∞; we will obtain:  Likewise, one can easily investigate that: .(9) By substituting ( 8) and ( 9) in ( 7), we will get: ).(10) In the light of ( 6) and (10), one can conclude that for a fixed ,  1 , , the bias of  ̂() tends to zero as  → ∞ and the variance of  ̂() tends to zero as either  or  → ∞ coming up  ( ̂()) → 0 which completes the proof, where MSE refers to the mean square error.
(b) Let: ) . = 1,2 …  Following the same procedure explained in the proof of Proposition 1 (), the desirable result will be obtained.

CDF Estimation Using Unmeasured units
In this part, we shall incorporate all the potential information provided by the unmeasured items to construct a general class of CDF estimators.Taking motivation from the aforesaid CDF estimators, our proposed class of CDF estimators are based on

Proof:
Let  ̂ * () be written as: where the three terms between brackets in (11) will be denoted respectively by ,  and .

Simulation Study
To assess the performance of the proposed procedures, a simulation study is conducted for various values of (), , ,  1 and  when the underlying distribution for the data are the standard normal (symmetric distribution) as well as the standard log normal (asymmetric distribution).Aiming better analyzing the simulated results, the effect of the quality of ranking process is also taken into account by adopting Dell and Clutter (1972)'s imperfect ranking model with correlation coefficient .Three different configurations of  are considered:  = 1 for perfect ranking,  = 0.9 for imperfect ranking with reasonable good accuracy, and  = 0.5 for imperfect ranking.The number of simulation runs is 10,000.The comparisons between these estimators are made in terms of relative efficiency (RE) defined as: larger than one implies that  ̂() asymptotically outperforms  ̂() at the point  and vice versa.Due to space considerations, we set  = 1 as we observed that RE is slightly affected by the number of cycles.Based on Tables (1 − 6), the following remarks can be highlighted.
-Firstly, one can clearly observe that  ̂() and  ̂1 () are somewhat better than the other CDF estimators and usually have REs greater than 1 in most considered cases.
-It seems that the information supported by the unmeasured units are useful when  lies at the lower tail of the population distribution even when the quality of ranking is weak.
-Changing the underlying distribution does not substantially affect on the patterns of the studied estimators regardless a few cases.In several cases, the REs are higher when the parent distribution is log normal distribution rather than normal distribution.
-All the REs are affected strongly by the location of the value .Generally speaking,  ̂() outperforms its counterpart in RSS when  closes to the center of the population distribution and  = 1.Yet if  > 1,  ̂() performs well when  tends to at least one of the boundaries.On the other hand,  ̂() is becoming more efficient than  ̂() if  is near to at least one of the boundaries with some exceptions.Interestingly, these results are generally valid regardless the quality ranking.
-The quality ranking has a positive effect on the performance of the studied estimators particularly for large  and  1 .As expectedly, this effect becomes stronger on the performance of the estimators based on missing data approach and weaker for the others.
-With keeping all the factors are fixed, increasing either  1 and  or  1 and  can provide improvement for the behavior of the almost studied estimators in some cases.
-It is also observed that increasing  with keeping the other factors are fixed may not improve the REs even if the ranking process is perfectly done.Further, in several cases, the REs of almost studied estimators is higher when  1 ≥  than  1 < .
As overall, we can say that the conventional RSS-based CDF is outperformed by at least one of VLRSS-based CDF estimators in almost the considered cases, particularly when  lies at least one of the boundaries of the parent distribution (either symmetric or asymmetric).In addition, it is clearly observed from the superiority of  ̂() and  ̂1 () over  ̂() in most cases and they are mostly alternating in the first and second places among all the considered estimators.

An Empirical Study
In what follows, we will illustrate the applicability of the proposed CDF by using real data set known as tree dataset found in Chen et al. (2004).This data set of size 396 observations and includes seven variables.Here, we will restrict our attention on two variables: "the entire height in feet" denoted by the response variable and the "diameter We will consider the tree dataset as the hypothetical population.For the same values of (), , ,  1 and  shown by Table (1 − 6), 10,000 samples with replacement are drawn using RSS and VLRSS schemes.Again for each the selected sample, the  ̂() and  ̂() are computed, then the ARE for all the considered CDF estimators are estimated and listed in Table 7.
It is clear that the results presented in Table 7 are consistent with those shown in Table (1 − 6).As  ̂() is outperformed by at least one of VLRSS-based CDF estimators particularly when () → 0.Moreover,  ̂() performs consistently well relative to its analog in RSS when  = 1 and the values of  around the center of the parent distribution.Also incorporating the information generated by the unmeasured items increases the efficiencies of the suggested CDF estimators provided that  closes to the lower tail of the population distribution.Further, increasing both  1 and  provides sometimes sizeable benefits for the efficiencies  ̂() and  ̂1 ().For additional detailed results concerned to one special case is given by Fig. 3

Fig. 3 .
Fig. 3.The population CDF and the CDF estimators based on kernel function based on tree dataset , which presents the population CDF,  ̂() and  ̂1 () based on tree dataset with  = 5,  = 1,  1 = 10,  = 3 and  = 10.It is apparent that the lines of the  ̂() and  ̂1 () are closer to the line of the true CDF when  becomes closer to the boundaries.Further at small (large) values of , the performance of  ̂() ( ̂1 ()) becomes better.Finally, one may be have to mention that all the numerical results displayed below are coded using R package and it is available upon request from the author.
→ stands for convergence in probability, Maximizing (s) with respect to  can be done as:

Table 1 .
It is also evident that kernel-based estimators are the best in almost considered cases.A considerable efficiency gain is obtained by incorporating the information generated from the unmeasured sampling items provided that  is near to the lower tail of the parent distribution.Thus we recommend to use  ̂1 () when () → 0 and the ranking quality is good enough.Otherwise,  ̂() is the best choice.In a subsequent The RE values of the CDF estimators using simulated normal distribution when Three novel estimators for CDF using VLRSS are suggested.It is theoretically shown that these estimators are consistent to the population CDF ().By incorporating the information generated from the unmeasured sampling items, a general class of CDF estimators is also constructed which enables us to develop our proposed estimators.Additionally, the consistency of this class of CDF estimators is also analytically derived.Based on a small numerical experiments, we observe that the traditional RSS-based CDF is outperformed by at least one of VLRSS-based CDF estimators particularly when  lies at least one of the boundaries of the parent distribution even if the quality ranking is poor.CDF under VLRSS design is a much important future topic.The author plans to take these points in the near future.

Table 2 .
The RE values of the CDF estimators using simulated normal distribution when