Skip to main content

Sampling Plans and Estimates

  • Chapter
  • First Online:
Statistics for Data Scientists

Part of the book series: Undergraduate Topics in Computer Science ((UTICS))

  • 4472 Accesses

Abstract

In the previous chapter we computed descriptive statistics for the dataset on faces. The results showed that the average rating was 58.37 and that men rated the faces higher than women on average. If we are only interested in the participants in the study and we are willing to believe that the results are fully deterministic, we could claim that the group of men rates higher than the group of women on average. However, if we believe that the ratings are not constant for one person for the same set of faces or if we would like to know whether our statements would also hold for a larger group of people (who did not participate in our experiment), we must understand what other results could have been observed in our study if we had conducted the experiment at another time with the same group of participants or with another group of participants. To be able to extend your conclusions beyond the observed data, which is called more technically statistical inference, you should wonder where the dataset came from, how participants were collected, and how the results were obtained. For example, if the women who participated in the study of rating faces all came from one small village in the Netherlands, while the men came from many different villages and cities in the Netherlands, you would probably agree that the comparison between the average ratings from men and women becomes less meaningful. In this situation the dataset is considered selective towards women in the small village. Selective means here that not all women from the villages and cities included in the study are represented by the women in the study, but only a specific subgroup of women have been included. To overcome these types of issues, we need to know about the concepts of population, sample, sampling procedures, and estimation of population characteristics, and also how these concepts are related to each other to be able to do proper statistical inference.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 16.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Deterministic means here that the scores of the participants on the tested faces will always be the same: they are without any uncertainty. In reality though, scoring or rating is subjective and variable even for just one person.

  2. 2.

    Not constant means that a rating of 60 for one face from one of the participants could also have been 50 or 65 if we had asked for the rating at another moment.

  3. 3.

    The theoretical value is a calculation procedure applied to all units in the population. It is considered theoretical, since the values on each unit in the population may not exist but we believe that they could exist. The faces dataset makes this clear, since we only know the ratings of faces for participants in the study, and values beyond the study only exist if faces were to be rated by other units as well.

  4. 4.

    Note that other terminology for sample and observation unit is used in different fields of science or even within statistics. For instance, units are sometimes called elements and an observation unit may be referred to as an elementary unit or element. The sample unit is sometimes referred to as enumeration or listing unit (see Levy and Lemeshow 2013).

  5. 5.

    Unfortunately, many studies on human beings hardly ever make their target population explicit, which makes research claims somewhat fuzzy.

  6. 6.

    Any subset of units from a population is called a sample, irrespective of how the sample has been obtained. We use the term sampling to refer to the activity of obtaining the units in a sample. This includes taking the measurements on the units and not just the physical collection of units.

  7. 7.

    We have to admit that Big Data is a bit of a buzz word that is not very well defined. For now we will just go with the fuzzy notion of a very large dataset; one consisting of many rows and columns.

  8. 8.

    Homogeneity means “being all the same or all of the same kind”.

  9. 9.

    Random is an often used word, but it is hard to properly define. In statistics we often use the term uniformly random for selection processes that are “governed by or involving equal chances for each unit”.

  10. 10.

    The research field that is concerned with understanding health and disease.

  11. 11.

    Experimental studies on human beings to determine if a new treatment is beneficial with respect to placebo or a currently available treatment.

  12. 12.

    Simple random sampling is frequently combined with other choices or settings (see stratified and cluster sampling).

  13. 13.

    Each unit in the population appears in 10 out of 20 possible samples \(S_{k}\). For instance, unit 3 appears in \(S_{1}\), \(S_{5}\), \(S_{6}\), \(S_{7}\), \(S_{11}\), \(S_{12}\), \(S_{13}\), \(S_{17}\), \(S_{18}\), and \(S_{19}\).

  14. 14.

    Notice the argument FALSE that is supplied to the sample function is to indicate that there is no replacement. See also next page on sampling without replacement.

  15. 15.

    Sampling with replacement implies that the units are put back into the population after being sampled and can be sampled again.

  16. 16.

    Note that we have provided a seed number (initial value) to make the procedure reproducible. Every time the procedure is run, you will obtain sample 15.

  17. 17.

    The x! notation is called “x-factorial”.

  18. 18.

    Note that it does not matter which n positions of a permutation of \(\left\{ 1,2,3,\ldots ,N\right\} \) you would choose. For our example there would be \(6!=720\) permutations of \(\{1,2,3,\ldots ,6\}\). If we consider the permutations for which the first three elements remain \(\{1,2,3\}\), there are 36 permutations (\(3!=6\) permutations of \(\{1,2,3\}\) and \(3!=6\) permutations of \(\{4,5,6\}\)) leading to the same sample \(\{1,2,3\}\). Thus there will be \(20=720/36\) different samples (given by \(S_{1}\), \(S_{2}, \ldots , S_{20}\)).

  19. 19.

    Sampling without replacement means that the unit that is collected for the sample is not placed back in the population. This is common in medical science, marketing, psychology, etc. Sampling with replacement puts the unit back every time it is collected. For research on animals in the wild, units are of course being placed back.

  20. 20.

    Prevalence is the proportion of the population that is affected by the disease and incidence is the proportion or probability of occurrence within a certain time period.

  21. 21.

    You will learn in the following chapter that this probability is equal to \(\frac{1000}{1010}\frac{999}{1009}\frac{998}{1008}\cdots \frac{901}{911}=\frac{910}{1010}\cdots \frac{901}{1001}=0.3508\).

  22. 22.

    Note that these probabilities do not necessarily add up to one, i.e., \(p_{1}+p_{2}+\cdots +p_{N}\ne 1\), since we allow \(S_{k}\cap S_{l}\ne \emptyset \). Furthermore, the probabilities are not always the same for each unit.

  23. 23.

    Vectors are mathematical entities. Here they are used to group the values \(x_{1}\), \(x_{2}, \ldots , x_{n}\) on variable x from sample \(S_{k}\). They are typically denoted as bold face \(\boldsymbol{x}_k\) to distinguish vectors from the single observation \(x_k\). By definition, vectors are presented as columns and to be able to present them as rows we indicate this by \(\boldsymbol{x}'_k\) and call this the transposed vector.

  24. 24.

    The expectation operator \(\mathbb {E}()\) is introduced here merely as a shorthand notation, but when we discuss probability and random variables more theoretically in Chap. 4 we will see that the expectation has a more formal interpretation.

  25. 25.

    For any estimator T we have the following convenient calculation rules: \(\mathbb {E}(cT)=c\mathbb {E}(T)\) and \(\mathrm {SE}(cT)=c\mathrm {SE}(T)\), where c is a constant.

  26. 26.

    If the variable x is ordinal or nominal the theory in Sect. 2.6 cannot just be used, since ordinal and nominal values cannot be interpreted numerically (see Chap. 1). The binary variable (being either ordinal or nominal) is the exception as long as we code the two possible outcomes as 0 and 1.

  27. 27.

    Note that it is not at all easy to produce “real” random numbers; this is true both when we use computers or when we shuffle cards. There is actually a scientific literature on how to shuffle cards such that they are really randomly ordered.

  28. 28.

    An algorithm is a set of instructions designed to perform a specific task.

  29. 29.

    So, why did we discuss it? Well, because it provides an easy start!.

  30. 30.

    The operation \(x\,\mathrm {mod}\,n\) provide the rest value when x is divided by n. For instance, \(26\,\mathrm {mod}\,7\) is equal to 5.

References

  • G. Casella, R.L. Berger, Statistical Inference, vol. 2 (Duxbury Pacific Grove, CA, 2002)

    MATH  Google Scholar 

  • W.G. Cochran, Sampling Techniques (Wiley, Hoboken, 2007)

    Google Scholar 

  • W. Kruskal, F. Mosteller, Representative sampling, iv: the history of the concept in statistics, 1895–1939, in International Statistical Review/Revue Internationale de Statistique (1980), pp. 169–195

    Google Scholar 

  • P.S. Levy, S. Lemeshow, Sampling of Populations: Methods and Applications (Wiley, Hoboken, 2013)

    Google Scholar 

  • M. Matsumoto, T. Nishimura, Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans. Model. Comput. Simul. (TOMACS) 8(1), 3–30 (1998)

    Article  Google Scholar 

  • J. Neyman, On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. Breakthroughs in Statistics (Springer, Berlin,1992), pp. 123–150

    Google Scholar 

  • D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures (CRC Press, Boca Raton, 2003)

    Google Scholar 

  • J.N. Towse, T. Loetscher, P. Brugger, Not all numbers are equal: preferences and biases among children and adults when generating random sequences. Front. Psychol. 5, 19 (2014)

    Google Scholar 

  • F. Were, G. Orwa, R. Odhiambo, A design unbiased variance estimator of the systematic sample means. Am. J. Theor. Appl. Stat. 4(3), 201–210 (2015)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maurits Kaptein .

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Kaptein, M., van den Heuvel, E. (2022). Sampling Plans and Estimates. In: Statistics for Data Scientists . Undergraduate Topics in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-10531-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-10531-0_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-10530-3

  • Online ISBN: 978-3-030-10531-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics