Skip to main content
Log in

Sampling and Sampling Frames in Big Data Epidemiology

  • Genetic Epidemiology (C Amos, Section Editor)
  • Published:
Current Epidemiology Reports Aims and scope Submit manuscript

Abstract

Purpose of Review

The ‘big data’ revolution affords the opportunity to reuse administrative datasets for public health research. While such datasets offer dramatically increased statistical power compared with conventional primary data collection, typically at much lower cost, their use also raises substantial inferential challenges. In particular, it can be difficult to make population inferences because the sampling frames for many administrative datasets are undefined. We reviewed options for accounting for sampling in big data epidemiology.

Recent Findings

We identified three common strategies for accounting for sampling when the data available were not collected from a deliberately constructed sample: (1) explicitly reconstruct the sampling frame, (2) test the potential impacts of sampling using sensitivity analyses, and (3) limit inference to sample.

Summary

Inference from big data can be challenging because the impacts of sampling are unclear. Attention to sampling frames can minimize risks of bias.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

Papers of particular interest, published recently, have been highlighted as: • Of importance •• Of major importance

  1. Brown B, Chui M, Manyika J. Are you ready for the era of ‘big data’. McKinsey Q. 2011;4:24–35.

    Google Scholar 

  2. Fallik D. For big data, big questions remain. Health Affairs (Project Hope). 2014;33:1111–4.

    Article  Google Scholar 

  3. Khoury MJ, Ioannidis JP. Big data meets public health. Science. 2014;346:1054–5.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  4. Mayer-Schönberger V, Cukier K. Big data: a revolution that will transform how we live, work, and think. Boston, MA: Houghton Mifflin Harcourt; 2013.

  5. Mooney SJ, Westreich DJ, El-Sayed AM. Epidemiology in the era of big data. Epidemiology (Cambridge, Mass). 2015;26:390.

    Article  Google Scholar 

  6. Davis-Kean PE, Jager J, Maslowsky J. Answering developmental questions using secondary data. Child Dev Perspect. 2015;9:256–61.

    Article  PubMed  PubMed Central  Google Scholar 

  7. Keyes K, Galea S. What matters most: quantifying an epidemiology of consequence. Ann Epidemiol. 2015;25:305–11.

    Article  PubMed  PubMed Central  Google Scholar 

  8. •• Stuart EA, Ackerman B, Westreich D. Generalizability of Randomized Trial Results to Target Populations: Design and Analysis Possibilities. Res Soc Work Pract. 2018;28:532–7 A clearly written introduction to the problems that arise from assuming trial populations represent a population at large, and some possible solutions.

    Article  PubMed  Google Scholar 

  9. Leventhal T, Brooks-Gunn J. Moving to opportunity: an experimental study of neighborhood effects on mental health. Am J Public Health. 2003;93:1576–82.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Scheaffer RL, Mendenhall W III, Ott RL, Gerow KG. Elementary survey sampling. Boston, MA: Cengage Learning; 2011.

  11. Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc. 1952;47:663–85.

    Article  Google Scholar 

  12. Rothman KJ, Greenland S, Lash TL, et al. Boston, MA: Little, Brown, and Company; 2008.

  13. •• Hargittai E. Is bigger always better? Potential biases of big data derived from social network sites. Ann Am Acad Polit Soc Sci. An excellently clear walk-though of conducting a validation study to test potential impacts of sampling in effluent data.

  14. Breslow NE, Lumley T, Ballantyne CM, Chambless LE, Kulich M. Improved Horvitz–Thompson estimation of model parameters from two-phase stratified samples: applications in epidemiology. Stat Biosci. 2009;1:32–49.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Deville J-C, Särndal C-E, Sautory O. Generalized raking procedures in survey sampling. J Am Stat Assoc. 1993;88:1013–20.

    Article  Google Scholar 

  16. •• Lesko CR, Buchanan AL, Westreich D, Edwards JK, Hudgens MG, Cole SR. Generalizing Study Results. Epidemiology. 2017;28:553–61 A clear explanation (with a worked example) of generalizability, targeted at an epidemiologist readership.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Winship C, Radbill L. Sampling weights and regression analysis. Sociol Methods Res. 1994;23:230–57.

    Article  Google Scholar 

  18. Greenland S. For and against methodologies: some perspectives on recent causal and statistical inference debates. Eur J Epidemiol. 2017;32:3–20.

    Article  PubMed  Google Scholar 

  19. Stephens-Davidowitz S. The cost of racial animus on a black candidate: evidence using Google search data. J Public Econ. 2014;118:26–40.

    Article  Google Scholar 

  20. Lash TL, Fox MP, Fink AK. Applying quantitative bias analysis to epidemiologic data. New York, NY: Springer Science & Business Media; 2011

  21. VanderWeele TJ, Ding P. Sensitivity analysis in observational research: introducing the E-value. Ann Intern Med. 2017;167:268–74.

    Article  PubMed  Google Scholar 

  22. Hernán MA. Does water kill? A call for less casual causal inferences. Ann Epidemiol. 2016;26:674–80.

    Article  PubMed  PubMed Central  Google Scholar 

  23. • Kaufman JS. There is no virtue in vagueness: comment on: causal identification: a charge of epidemiology in danger of marginalization by Sharon Schwartz, Nicolle M. Gatto, and Ulka B. Campbell. Ann Epidemiol. 2016;26:683–4 A concise commentary (with a hilarious example) laying out the issues in the present controversy over epidemiology's focus.

    Article  PubMed  Google Scholar 

  24. Krieger N, Davey SG. The tale wagged by the DAG: broadening the scope of causal inference and explanation for epidemiology. Int J Epidemiol. 2016;45:1787–808.

    PubMed  Google Scholar 

  25. Schwartz S, Gatto NM, Campbell UB. Causal identification: a charge of epidemiology in danger of marginalization. Ann Epidemiol. 2016;26:669–73.

    Article  PubMed  Google Scholar 

  26. Vandenbroucke JP, Broadbent A, Pearce N. Causality and causal inference in epidemiology: the need for a pluralistic approach. Int J Epidemiol. 2016;45:1776–86.

    Article  PubMed  PubMed Central  Google Scholar 

  27. • Mooney SJ, Pejaver V. Big data in public health: terminology, machine learning, and privacy. Annu Rev Public Health. 2018:95–112 An overview of selected current issues regarding the use of big data for public health purposes.

  28. • Duncan DT, Sharifi M, Melly SJ, Marshall R, Sequist TD, Rifas-Shiman SL, et al. Characteristics of walkable built environments and BMI z-scores in children: evidence from a large electronic health record database. Environ Health Perspect. 2014;122:1359 A well-conducted analysis making use of electronic health record data.

    Article  PubMed  PubMed Central  Google Scholar 

  29. Hernán MA, McAdams M, McGrath N, Lanoy E, Costagliola D. Observation plans in longitudinal studies with time-varying treatments. Stat Methods Med Res. 2009;18:27–52.

    Article  PubMed  Google Scholar 

  30. Mooney SJ. Invited commentary: the tao of clinical cohort analysis—when the transitions that can be spoken of are not the true transitions. Am J Epidemiol. 2017;185:636–8.

    Article  PubMed  PubMed Central  Google Scholar 

  31. Harris JK, Mansour R, Choucair B, et al. Health department use of social media to identify foodborne illness - Chicago, Illinois, 2013-2014. MMWR Morb Mortal Wkly Rep. 2014;63(32):681–5 http://www.ncbi.nlm.nih.gov/pubmed/25121710. Accessed September 20, 2018.

    PubMed  PubMed Central  Google Scholar 

  32. Harrison C, Jorder M, Stern H, et al. Using online reviews by restaurant patrons to identify unreported cases of foodborne illness - new York City, 2012-2013. MMWR Morb Mortal Wkly Rep. 2014;63(20):441–5 http://www.ncbi.nlm.nih.gov/pubmed/24848215. Accessed September 20, 2018.

    PubMed  PubMed Central  Google Scholar 

  33. Oldroyd RA, Morris MA, Birkin M. Identifying methods for monitoring foodborne illness: review of existing public health surveillance techniques. JMIR Public Heal Surveill. 2018;4(2):e57. https://doi.org/10.2196/publichealth.8218.

    Article  Google Scholar 

  34. Mead PS, Slutsker L, Dietz V, McCaig LF, Bresee JS, Shapiro C, et al. Food-related illness and death in the United States. Emerg Infect Dis. 1999;5(5):607–25. https://doi.org/10.3201/eid0505.990502.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  35. Henly S, Tuli G, Kluberg SA, Hawkins JB, Nguyen QC, Anema A, et al. Disparities in digital reporting of illness: a demographic and socioeconomic assessment. Prev Med (Baltim). 2017;101:18–22. https://doi.org/10.1016/J.YPMED.2017.05.009.

    Article  Google Scholar 

  36. Adams NL, Rose TC, Hawker J, Violato M, O’Brien SJ, Barr B, et al. Relationship between socioeconomic status and gastrointestinal infections in developed countries: a systematic review and meta-analysis. PLoS One. 2018;13(1):e0191633. https://doi.org/10.1371/journal.pone.0191633.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  37. Jacobs N, Roman N, Pless R. Consistent temporal variations in many outdoor scenes. IEEE. 2007:1–6.

  38. • Westreich D, Edwards JK, Lesko CR, Stuart E, Cole SR. Transportability of trial results using inverse odds of sampling weights. Am J Epidemiol. 2017;186:1010–4 A clearly written piece that can assist intuition on how weighting accounts for sampling artifacts.

    Article  PubMed  PubMed Central  Google Scholar 

  39. Hipp JA, Adlakha D, Eyler AA, Chang B, Pless R. Emerging technologies: webcams and crowd-sourcing to identify active transportation. Am J Prev Med. 2013;44:96–7.

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Funding

This work was supported by a grant from the National Library of Medicine (1K99LM012868) and the National Heart, Lung, And Blood Institute (F31HL143900). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stephen J. Mooney.

Ethics declarations

Conflict of Interest

Stephen J. Mooney reports grants from National Library of Medicine, and the Better Bike Share Coalition during the conduct of the study. Michael D. Garber reports grants from National Heart, Lung, and Blood Institute and from American College of Sports Medicine during the conduct of the study.

Human and Animal Rights and Informed Consent

This article does not contain any studies with human or animal subjects performed by any of the authors.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the Topical Collection on Genetic Epidemiology

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mooney, S.J., Garber, M.D. Sampling and Sampling Frames in Big Data Epidemiology. Curr Epidemiol Rep 6, 14–22 (2019). https://doi.org/10.1007/s40471-019-0179-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40471-019-0179-y

Keywords

Navigation