Sampling and Sampling Frames in Big Data Epidemiology

Mooney, Stephen J.; Garber, Michael D.

doi:10.1007/s40471-019-0179-y

Sampling and Sampling Frames in Big Data Epidemiology

Genetic Epidemiology (C Amos, Section Editor)
Published: 02 February 2019

Volume 6, pages 14–22, (2019)
Cite this article

Current Epidemiology Reports Aims and scope Submit manuscript

Stephen J. Mooney^1,2 &
Michael D. Garber³

790 Accesses
12 Citations
2 Altmetric
Explore all metrics

Abstract

Purpose of Review

The ‘big data’ revolution affords the opportunity to reuse administrative datasets for public health research. While such datasets offer dramatically increased statistical power compared with conventional primary data collection, typically at much lower cost, their use also raises substantial inferential challenges. In particular, it can be difficult to make population inferences because the sampling frames for many administrative datasets are undefined. We reviewed options for accounting for sampling in big data epidemiology.

Recent Findings

We identified three common strategies for accounting for sampling when the data available were not collected from a deliberately constructed sample: (1) explicitly reconstruct the sampling frame, (2) test the potential impacts of sampling using sensitivity analyses, and (3) limit inference to sample.

Summary

Inference from big data can be challenging because the impacts of sampling are unclear. Attention to sampling frames can minimize risks of bias.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sampling Techniques for Quantitative Research

A Tutorial on Applying the Difference-in-Differences Method to Health Data

Article Open access 07 September 2023

Protecting against researcher bias in secondary data analysis: challenges and potential solutions

Article Open access 13 January 2022

References

Papers of particular interest, published recently, have been highlighted as: • Of importance •• Of major importance

Brown B, Chui M, Manyika J. Are you ready for the era of ‘big data’. McKinsey Q. 2011;4:24–35.
Google Scholar
Fallik D. For big data, big questions remain. Health Affairs (Project Hope). 2014;33:1111–4.
Article Google Scholar
Khoury MJ, Ioannidis JP. Big data meets public health. Science. 2014;346:1054–5.
Article PubMed PubMed Central CAS Google Scholar
Mayer-Schönberger V, Cukier K. Big data: a revolution that will transform how we live, work, and think. Boston, MA: Houghton Mifflin Harcourt; 2013.
Mooney SJ, Westreich DJ, El-Sayed AM. Epidemiology in the era of big data. Epidemiology (Cambridge, Mass). 2015;26:390.
Article Google Scholar
Davis-Kean PE, Jager J, Maslowsky J. Answering developmental questions using secondary data. Child Dev Perspect. 2015;9:256–61.
Article PubMed PubMed Central Google Scholar
Keyes K, Galea S. What matters most: quantifying an epidemiology of consequence. Ann Epidemiol. 2015;25:305–11.
Article PubMed PubMed Central Google Scholar
•• Stuart EA, Ackerman B, Westreich D. Generalizability of Randomized Trial Results to Target Populations: Design and Analysis Possibilities. Res Soc Work Pract. 2018;28:532–7 A clearly written introduction to the problems that arise from assuming trial populations represent a population at large, and some possible solutions.
Article PubMed Google Scholar
Leventhal T, Brooks-Gunn J. Moving to opportunity: an experimental study of neighborhood effects on mental health. Am J Public Health. 2003;93:1576–82.
Article PubMed PubMed Central Google Scholar
Scheaffer RL, Mendenhall W III, Ott RL, Gerow KG. Elementary survey sampling. Boston, MA: Cengage Learning; 2011.
Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc. 1952;47:663–85.
Article Google Scholar
Rothman KJ, Greenland S, Lash TL, et al. Boston, MA: Little, Brown, and Company; 2008.
•• Hargittai E. Is bigger always better? Potential biases of big data derived from social network sites. Ann Am Acad Polit Soc Sci. An excellently clear walk-though of conducting a validation study to test potential impacts of sampling in effluent data.
Breslow NE, Lumley T, Ballantyne CM, Chambless LE, Kulich M. Improved Horvitz–Thompson estimation of model parameters from two-phase stratified samples: applications in epidemiology. Stat Biosci. 2009;1:32–49.
Article PubMed PubMed Central Google Scholar
Deville J-C, Särndal C-E, Sautory O. Generalized raking procedures in survey sampling. J Am Stat Assoc. 1993;88:1013–20.
Article Google Scholar
•• Lesko CR, Buchanan AL, Westreich D, Edwards JK, Hudgens MG, Cole SR. Generalizing Study Results. Epidemiology. 2017;28:553–61 A clear explanation (with a worked example) of generalizability, targeted at an epidemiologist readership.
Article PubMed PubMed Central Google Scholar
Winship C, Radbill L. Sampling weights and regression analysis. Sociol Methods Res. 1994;23:230–57.
Article Google Scholar
Greenland S. For and against methodologies: some perspectives on recent causal and statistical inference debates. Eur J Epidemiol. 2017;32:3–20.
Article PubMed Google Scholar
Stephens-Davidowitz S. The cost of racial animus on a black candidate: evidence using Google search data. J Public Econ. 2014;118:26–40.
Article Google Scholar
Lash TL, Fox MP, Fink AK. Applying quantitative bias analysis to epidemiologic data. New York, NY: Springer Science & Business Media; 2011
VanderWeele TJ, Ding P. Sensitivity analysis in observational research: introducing the E-value. Ann Intern Med. 2017;167:268–74.
Article PubMed Google Scholar
Hernán MA. Does water kill? A call for less casual causal inferences. Ann Epidemiol. 2016;26:674–80.
Article PubMed PubMed Central Google Scholar
• Kaufman JS. There is no virtue in vagueness: comment on: causal identification: a charge of epidemiology in danger of marginalization by Sharon Schwartz, Nicolle M. Gatto, and Ulka B. Campbell. Ann Epidemiol. 2016;26:683–4 A concise commentary (with a hilarious example) laying out the issues in the present controversy over epidemiology's focus.
Article PubMed Google Scholar
Krieger N, Davey SG. The tale wagged by the DAG: broadening the scope of causal inference and explanation for epidemiology. Int J Epidemiol. 2016;45:1787–808.
PubMed Google Scholar
Schwartz S, Gatto NM, Campbell UB. Causal identification: a charge of epidemiology in danger of marginalization. Ann Epidemiol. 2016;26:669–73.
Article PubMed Google Scholar
Vandenbroucke JP, Broadbent A, Pearce N. Causality and causal inference in epidemiology: the need for a pluralistic approach. Int J Epidemiol. 2016;45:1776–86.
Article PubMed PubMed Central Google Scholar
• Mooney SJ, Pejaver V. Big data in public health: terminology, machine learning, and privacy. Annu Rev Public Health. 2018:95–112 An overview of selected current issues regarding the use of big data for public health purposes.
• Duncan DT, Sharifi M, Melly SJ, Marshall R, Sequist TD, Rifas-Shiman SL, et al. Characteristics of walkable built environments and BMI z-scores in children: evidence from a large electronic health record database. Environ Health Perspect. 2014;122:1359 A well-conducted analysis making use of electronic health record data.
Article PubMed PubMed Central Google Scholar
Hernán MA, McAdams M, McGrath N, Lanoy E, Costagliola D. Observation plans in longitudinal studies with time-varying treatments. Stat Methods Med Res. 2009;18:27–52.
Article PubMed Google Scholar
Mooney SJ. Invited commentary: the tao of clinical cohort analysis—when the transitions that can be spoken of are not the true transitions. Am J Epidemiol. 2017;185:636–8.
Article PubMed PubMed Central Google Scholar
Harris JK, Mansour R, Choucair B, et al. Health department use of social media to identify foodborne illness - Chicago, Illinois, 2013-2014. MMWR Morb Mortal Wkly Rep. 2014;63(32):681–5 http://www.ncbi.nlm.nih.gov/pubmed/25121710. Accessed September 20, 2018.
PubMed PubMed Central Google Scholar
Harrison C, Jorder M, Stern H, et al. Using online reviews by restaurant patrons to identify unreported cases of foodborne illness - new York City, 2012-2013. MMWR Morb Mortal Wkly Rep. 2014;63(20):441–5 http://www.ncbi.nlm.nih.gov/pubmed/24848215. Accessed September 20, 2018.
PubMed PubMed Central Google Scholar
Oldroyd RA, Morris MA, Birkin M. Identifying methods for monitoring foodborne illness: review of existing public health surveillance techniques. JMIR Public Heal Surveill. 2018;4(2):e57. https://doi.org/10.2196/publichealth.8218.
Article Google Scholar
Mead PS, Slutsker L, Dietz V, McCaig LF, Bresee JS, Shapiro C, et al. Food-related illness and death in the United States. Emerg Infect Dis. 1999;5(5):607–25. https://doi.org/10.3201/eid0505.990502.
Article PubMed PubMed Central CAS Google Scholar
Henly S, Tuli G, Kluberg SA, Hawkins JB, Nguyen QC, Anema A, et al. Disparities in digital reporting of illness: a demographic and socioeconomic assessment. Prev Med (Baltim). 2017;101:18–22. https://doi.org/10.1016/J.YPMED.2017.05.009.
Article Google Scholar
Adams NL, Rose TC, Hawker J, Violato M, O’Brien SJ, Barr B, et al. Relationship between socioeconomic status and gastrointestinal infections in developed countries: a systematic review and meta-analysis. PLoS One. 2018;13(1):e0191633. https://doi.org/10.1371/journal.pone.0191633.
Article PubMed PubMed Central CAS Google Scholar
Jacobs N, Roman N, Pless R. Consistent temporal variations in many outdoor scenes. IEEE. 2007:1–6.
• Westreich D, Edwards JK, Lesko CR, Stuart E, Cole SR. Transportability of trial results using inverse odds of sampling weights. Am J Epidemiol. 2017;186:1010–4 A clearly written piece that can assist intuition on how weighting accounts for sampling artifacts.
Article PubMed PubMed Central Google Scholar
Hipp JA, Adlakha D, Eyler AA, Chang B, Pless R. Emerging technologies: webcams and crowd-sourcing to identify active transportation. Am J Prev Med. 2013;44:96–7.
Article PubMed PubMed Central Google Scholar

Download references

Funding

This work was supported by a grant from the National Library of Medicine (1K99LM012868) and the National Heart, Lung, And Blood Institute (F31HL143900). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Authors and Affiliations

Department of Epidemiology, University of Washington, 1959 NE Pacific Street, Health Sciences Bldg, F-262, Box 357236, Seattle, WA, 98195, USA
Stephen J. Mooney
Harborview Injury Prevention and Research Center, University of Washington, Seattle, WA, USA
Stephen J. Mooney
Department of Epidemiology, Rollins School of Public Health, Emory University, Atlanta, GA, USA
Michael D. Garber

Authors

Stephen J. Mooney
View author publications
You can also search for this author in PubMed Google Scholar
Michael D. Garber
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stephen J. Mooney.

Ethics declarations

Conflict of Interest

Stephen J. Mooney reports grants from National Library of Medicine, and the Better Bike Share Coalition during the conduct of the study. Michael D. Garber reports grants from National Heart, Lung, and Blood Institute and from American College of Sports Medicine during the conduct of the study.

Human and Animal Rights and Informed Consent

This article does not contain any studies with human or animal subjects performed by any of the authors.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the Topical Collection on Genetic Epidemiology

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mooney, S.J., Garber, M.D. Sampling and Sampling Frames in Big Data Epidemiology. Curr Epidemiol Rep 6, 14–22 (2019). https://doi.org/10.1007/s40471-019-0179-y

Download citation

Published: 02 February 2019
Issue Date: 15 March 2019
DOI: https://doi.org/10.1007/s40471-019-0179-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sampling and Sampling Frames in Big Data Epidemiology