An empirical assessment of ranking accuracy in ranked set sampling

https://doi.org/10.1016/j.csda.2006.07.018Get rights and content

Abstract

Ranked set sampling (RSS) involves ranking of potential sampling units on the variable of interest using judgment or an auxiliary variable to aid in sample selection. Its effectiveness depends on the success in this ranking. We provide an empirical assessment of RSS ranking accuracy in estimation of a population proportion.

Introduction

Ranked set sampling (RSS) employs ranking of the characteristic of interest prior to sampling via auxiliary information to improve estimation of a population parameter. It was first proposed by McIntyre (1952) for estimating mean pasture yield. Generally, the most promising setting for RSS is where sample units can be readily ranked (either through subjective judgment or via the use of a concomitant variable) but where the exact measurement of sample units is costly in time or effort. Patil (1995) provided more details about appropriate settings for the use of RSS.

A general RSS procedure can be carried out as follows. For r=1,,m, sample at random nr>0 sets of size m (m is referred to as the set size) units each from the population of interest and rank the units within each set on the variable of interest, say X, without making actual measurements on X. The rankings can be the result of judgment ranking or obtained via a concomitant variable or several concomitant variables (regression approach). Then the rth smallest ranked item in each of the nr sets is quantified for r=1,,m. The remaining nrm-nr units are used solely for the purpose of the ranking and are not quantified. We note that in the RSS protocol the unquantified units provide additional information about the characteristics of interest through the ranking process and therefore lead to improved estimation of population parameters.

When rankings are not perfect, we obtain judgment order statistics as opposed to the true order statistics obtained under perfect rankings. Thus, the notations [] and () are used in the subscripts to represent possibly erroneous and perfect rankings, respectively. Let X[r] represent the collection of rth judgment order statistics and let X[r]j denote the jth observation from that collection. This sampling scheme yields the following RSS of size n with r=1mnr=n. We note that nr is actually the sample size for rank r. In the situation where n1=n2==nm, the procedure leads to a balanced RSS; otherwise, it leads to an unbalanced RSS.

In the case where the population parameter of interest is the mean of X, the RSS estimator of the population mean μ is defined asμ^RSS=1mr=1m1nrj=1nrX[r]j.Theoretical results have shown that the RSS estimator of a population mean is unbiased regardless of ranking errors (Dell and Clutter, 1972). That is,Eμ^RSS=μ.

The ranking process is an important aspect of the RSS procedure. The precision of RSS relative to simple random sampling (SRS) with the same number of quantifications, defined as the ratio of the two variances for the corresponding estimators, depends to a great extent on the success in ranking. In general, for a fixed sample allocation of nr, the more accurate the ranking within each set, the more precise the RSS procedure will be. In the case of using a single concomitant variable for ranking, an extensive literature has shown that the amount of increase in the precision of the RSS estimator is dependent on the association between the auxiliary variable and the continuous variable of interest (see, for example, Stokes, 1977, Bohn, 1996; Husby et al., 2005). Furthermore, it does not matter whether the association is positive or negative.

We note that an assessment of the accuracy in ranking has not been addressed previously in the RSS literature. In this paper, we present an empirical comparison of judgment rankings versus the true ranks of the observations in the context of RSS for binary variables since the application of RSS for estimating population proportions has not been studied thoroughly (Lacayo et al., 2002, Chen, 2006).

When the variable of interest is binary, the rankings can be accomplished through a concomitant variable (Terpstra and Liudahl, 2004) or, more generally, a logistic regression model that could be previously developed or fit from an auxiliary data set (Chen et al., 2005). Specifically, let πi denote the probability of success for an individual i in a set of size m from a Bernoulli(p) distribution. The estimated probability of success for this individual, π^i, can be obtained from a logistic regression model fit from a training sample. Then, the m sample units can be ranked according to their estimated probabilities of success, π^1,,π^m. In this study, we use a substantial data set, the third National Health and Nutrition Examination Survey (NHANES III), to evaluate empirically the accuracy in rankings obtained in this manner.

In the following section, we describe our criteria for evaluating the accuracy in rankings. Section 3 describes the data set and the variables used in the simulation study. Section 4 presents the results of our study, and in Section 5 we discuss conclusions.

Section snippets

Evaluation of accuracy in ranking

The accuracy of ranking can be assessed through an m×m matrix P=pij, where the element pij denotes the probability that the item with actual rank i in a given set of size m is designated to be the jth judgment order statistic (Bohn and Wolfe, 1994, Stark and Wolfe, 2002). That is, pij=PX(i)=X[j]. For perfect ranking, pii=1 for i=1,,m and pij=0 for ij=1,,m. If the ranking process is completely random then pij=1/m for all i and j. For each j=1,,m, i=1mpij=1 since the jth judgment order

Description of data set and variables

The Third National Health and Nutrition Examination Survey (NHANES III, 1988–1994) was conducted by the National Center for Health Statistics, Centers for Disease Control and Prevention. This survey was designed to obtain nationally representative information on the health and nutritional status of the population of the United States (for more details, see the National Center for Health Statistics website, 2002). The data set contains information for 33,994 persons age 2 months and older who

Accuracy of rankings

We use the p^ij's, as defined in Section 2, to assess the accuracy of ranking in both the single concomitant variable setting and the logistic regression setting. We chose G=10,000 iterations in each setting. Set sizes of 4 and 10 were used to investigate the effect of set sizes on accuracy in rankings. In all the tables, ‘true rank’ is the rank based on the numerical values of BMI and ‘assigned rank’ is the rank assigned by the ranking process using various single concomitants or logistic

Conclusions

In this study, we focus on an empirical evaluation of the accuracy in RSS rankings. Through the P matrix, defined in Section 2, we are able to demonstrate and compare the patterns of accuracy in rankings for various ranking processes. Our results provide numerical evidence for the common belief that the accuracy in rankings increases as the magnitude of correlation between a ranking variable and the variable of interest increases. Moreover, our study clearly demonstrates the benefits of using

Acknowledgments

This work was supported in part by National Science Foundation Grant number DMS-9802358. We thank the US Department of Health and Human Services, National Center for Health Statistics for allowing us to use the NHANES III (1988–1994) data set. We also thank the referees for their helpful suggestions that led to a better presentation of our results in this paper.

References (15)

  • L.L. Bohn

    A review of nonparametric ranked set sampling methodology

    Commun. Statist. Theory Methods

    (1996)
  • L.L. Bohn et al.

    The effect of imperfect rankings on properties of procedures based on the ranked set samples analog of Mann–Whitney–Wilcoxon statistic

    J. Amer. Statist. Assoc.

    (1994)
  • Chen, H., 2006. Alternative ranked set sampling estimators for the variance of a sample proportion. J. Appl. Statist....
  • H. Chen et al.

    Ranked set sampling for efficient estimation of a population proportion

    Statist. Med.

    (2005)
  • T.R. Dell et al.

    Ranked set sampling theory with order statistics background

    Biometrics

    (1972)
  • D.W. Hosmer et al.

    Applied Logistic Regression

    (2000)
  • Husby, C.E., Stasny, E.A., Wolfe, D.A., 2005. An application of ranked set sampling for mean and median estimation...
There are more references available in the full text version of this article.

Cited by (0)

View full text