Abstract
Purpose of Review
The ‘big data’ revolution affords the opportunity to reuse administrative datasets for public health research. While such datasets offer dramatically increased statistical power compared with conventional primary data collection, typically at much lower cost, their use also raises substantial inferential challenges. In particular, it can be difficult to make population inferences because the sampling frames for many administrative datasets are undefined. We reviewed options for accounting for sampling in big data epidemiology.
Recent Findings
We identified three common strategies for accounting for sampling when the data available were not collected from a deliberately constructed sample: (1) explicitly reconstruct the sampling frame, (2) test the potential impacts of sampling using sensitivity analyses, and (3) limit inference to sample.
Summary
Inference from big data can be challenging because the impacts of sampling are unclear. Attention to sampling frames can minimize risks of bias.
Similar content being viewed by others
References
Papers of particular interest, published recently, have been highlighted as: • Of importance •• Of major importance
Brown B, Chui M, Manyika J. Are you ready for the era of ‘big data’. McKinsey Q. 2011;4:24–35.
Fallik D. For big data, big questions remain. Health Affairs (Project Hope). 2014;33:1111–4.
Khoury MJ, Ioannidis JP. Big data meets public health. Science. 2014;346:1054–5.
Mayer-Schönberger V, Cukier K. Big data: a revolution that will transform how we live, work, and think. Boston, MA: Houghton Mifflin Harcourt; 2013.
Mooney SJ, Westreich DJ, El-Sayed AM. Epidemiology in the era of big data. Epidemiology (Cambridge, Mass). 2015;26:390.
Davis-Kean PE, Jager J, Maslowsky J. Answering developmental questions using secondary data. Child Dev Perspect. 2015;9:256–61.
Keyes K, Galea S. What matters most: quantifying an epidemiology of consequence. Ann Epidemiol. 2015;25:305–11.
•• Stuart EA, Ackerman B, Westreich D. Generalizability of Randomized Trial Results to Target Populations: Design and Analysis Possibilities. Res Soc Work Pract. 2018;28:532–7 A clearly written introduction to the problems that arise from assuming trial populations represent a population at large, and some possible solutions.
Leventhal T, Brooks-Gunn J. Moving to opportunity: an experimental study of neighborhood effects on mental health. Am J Public Health. 2003;93:1576–82.
Scheaffer RL, Mendenhall W III, Ott RL, Gerow KG. Elementary survey sampling. Boston, MA: Cengage Learning; 2011.
Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc. 1952;47:663–85.
Rothman KJ, Greenland S, Lash TL, et al. Boston, MA: Little, Brown, and Company; 2008.
•• Hargittai E. Is bigger always better? Potential biases of big data derived from social network sites. Ann Am Acad Polit Soc Sci. An excellently clear walk-though of conducting a validation study to test potential impacts of sampling in effluent data.
Breslow NE, Lumley T, Ballantyne CM, Chambless LE, Kulich M. Improved Horvitz–Thompson estimation of model parameters from two-phase stratified samples: applications in epidemiology. Stat Biosci. 2009;1:32–49.
Deville J-C, Särndal C-E, Sautory O. Generalized raking procedures in survey sampling. J Am Stat Assoc. 1993;88:1013–20.
•• Lesko CR, Buchanan AL, Westreich D, Edwards JK, Hudgens MG, Cole SR. Generalizing Study Results. Epidemiology. 2017;28:553–61 A clear explanation (with a worked example) of generalizability, targeted at an epidemiologist readership.
Winship C, Radbill L. Sampling weights and regression analysis. Sociol Methods Res. 1994;23:230–57.
Greenland S. For and against methodologies: some perspectives on recent causal and statistical inference debates. Eur J Epidemiol. 2017;32:3–20.
Stephens-Davidowitz S. The cost of racial animus on a black candidate: evidence using Google search data. J Public Econ. 2014;118:26–40.
Lash TL, Fox MP, Fink AK. Applying quantitative bias analysis to epidemiologic data. New York, NY: Springer Science & Business Media; 2011
VanderWeele TJ, Ding P. Sensitivity analysis in observational research: introducing the E-value. Ann Intern Med. 2017;167:268–74.
Hernán MA. Does water kill? A call for less casual causal inferences. Ann Epidemiol. 2016;26:674–80.
• Kaufman JS. There is no virtue in vagueness: comment on: causal identification: a charge of epidemiology in danger of marginalization by Sharon Schwartz, Nicolle M. Gatto, and Ulka B. Campbell. Ann Epidemiol. 2016;26:683–4 A concise commentary (with a hilarious example) laying out the issues in the present controversy over epidemiology's focus.
Krieger N, Davey SG. The tale wagged by the DAG: broadening the scope of causal inference and explanation for epidemiology. Int J Epidemiol. 2016;45:1787–808.
Schwartz S, Gatto NM, Campbell UB. Causal identification: a charge of epidemiology in danger of marginalization. Ann Epidemiol. 2016;26:669–73.
Vandenbroucke JP, Broadbent A, Pearce N. Causality and causal inference in epidemiology: the need for a pluralistic approach. Int J Epidemiol. 2016;45:1776–86.
• Mooney SJ, Pejaver V. Big data in public health: terminology, machine learning, and privacy. Annu Rev Public Health. 2018:95–112 An overview of selected current issues regarding the use of big data for public health purposes.
• Duncan DT, Sharifi M, Melly SJ, Marshall R, Sequist TD, Rifas-Shiman SL, et al. Characteristics of walkable built environments and BMI z-scores in children: evidence from a large electronic health record database. Environ Health Perspect. 2014;122:1359 A well-conducted analysis making use of electronic health record data.
Hernán MA, McAdams M, McGrath N, Lanoy E, Costagliola D. Observation plans in longitudinal studies with time-varying treatments. Stat Methods Med Res. 2009;18:27–52.
Mooney SJ. Invited commentary: the tao of clinical cohort analysis—when the transitions that can be spoken of are not the true transitions. Am J Epidemiol. 2017;185:636–8.
Harris JK, Mansour R, Choucair B, et al. Health department use of social media to identify foodborne illness - Chicago, Illinois, 2013-2014. MMWR Morb Mortal Wkly Rep. 2014;63(32):681–5 http://www.ncbi.nlm.nih.gov/pubmed/25121710. Accessed September 20, 2018.
Harrison C, Jorder M, Stern H, et al. Using online reviews by restaurant patrons to identify unreported cases of foodborne illness - new York City, 2012-2013. MMWR Morb Mortal Wkly Rep. 2014;63(20):441–5 http://www.ncbi.nlm.nih.gov/pubmed/24848215. Accessed September 20, 2018.
Oldroyd RA, Morris MA, Birkin M. Identifying methods for monitoring foodborne illness: review of existing public health surveillance techniques. JMIR Public Heal Surveill. 2018;4(2):e57. https://doi.org/10.2196/publichealth.8218.
Mead PS, Slutsker L, Dietz V, McCaig LF, Bresee JS, Shapiro C, et al. Food-related illness and death in the United States. Emerg Infect Dis. 1999;5(5):607–25. https://doi.org/10.3201/eid0505.990502.
Henly S, Tuli G, Kluberg SA, Hawkins JB, Nguyen QC, Anema A, et al. Disparities in digital reporting of illness: a demographic and socioeconomic assessment. Prev Med (Baltim). 2017;101:18–22. https://doi.org/10.1016/J.YPMED.2017.05.009.
Adams NL, Rose TC, Hawker J, Violato M, O’Brien SJ, Barr B, et al. Relationship between socioeconomic status and gastrointestinal infections in developed countries: a systematic review and meta-analysis. PLoS One. 2018;13(1):e0191633. https://doi.org/10.1371/journal.pone.0191633.
Jacobs N, Roman N, Pless R. Consistent temporal variations in many outdoor scenes. IEEE. 2007:1–6.
• Westreich D, Edwards JK, Lesko CR, Stuart E, Cole SR. Transportability of trial results using inverse odds of sampling weights. Am J Epidemiol. 2017;186:1010–4 A clearly written piece that can assist intuition on how weighting accounts for sampling artifacts.
Hipp JA, Adlakha D, Eyler AA, Chang B, Pless R. Emerging technologies: webcams and crowd-sourcing to identify active transportation. Am J Prev Med. 2013;44:96–7.
Funding
This work was supported by a grant from the National Library of Medicine (1K99LM012868) and the National Heart, Lung, And Blood Institute (F31HL143900). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
Stephen J. Mooney reports grants from National Library of Medicine, and the Better Bike Share Coalition during the conduct of the study. Michael D. Garber reports grants from National Heart, Lung, and Blood Institute and from American College of Sports Medicine during the conduct of the study.
Human and Animal Rights and Informed Consent
This article does not contain any studies with human or animal subjects performed by any of the authors.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the Topical Collection on Genetic Epidemiology
Rights and permissions
About this article
Cite this article
Mooney, S.J., Garber, M.D. Sampling and Sampling Frames in Big Data Epidemiology. Curr Epidemiol Rep 6, 14–22 (2019). https://doi.org/10.1007/s40471-019-0179-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40471-019-0179-y