December 2022 Bayesian data synthesis and the utility-risk trade-off for mixed epidemiological data
Joseph Feldman, Daniel R. Kowal
Author Affiliations +
Ann. Appl. Stat. 16(4): 2577-2602 (December 2022). DOI: 10.1214/22-AOAS1604

Abstract

Much of the microdata used for epidemiological studies contain sensitive measurements on real individuals. As a result, such microdata cannot be published out of privacy concerns, and without public access to these data, any statistical analyses originally published on them are nearly impossible to reproduce. To promote the dissemination of key datasets for analysis without jeopardizing the privacy of individuals, we introduce a cohesive Bayesian framework for the generation of fully synthetic high-dimensional microdatasets of mixed categorical, binary, count, and continuous variables. This process centers around a joint Bayesian model that is simultaneously compatible with all of these data types, enabling the creation of mixed synthetic datasets through posterior predictive sampling. Furthermore, a focal point of epidemiological data analysis is the study of conditional relationships between various exposures and key outcome variables through regression analysis. We design a modified data synthesis strategy to target and preserve these conditional relationships, including both nonlinearities and interactions. The proposed techniques are deployed to create a synthetic version of a confidential dataset containing dozens of health, cognitive, and social measurements on nearly 20,000 North Carolina children.

Funding Statement

Research reported in this publication was supported by the National Institute of Environmental Health Sciences of the National Institutes of Health under award number R01ES028819 and the Army Research Office (Kowal) under award number W911NF-20-1-0184.

Acknowledgments

The authors would like to thank the reviewers for their constructive comments that have greatly improved the paper. In addition, the authors thank Marie Lynn Miranda and Katherine B. Ensor for their valuable insights and feedback.

The content, views, and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the National Institutes of Health, the North Carolina Department of Health and Human Services, Division of Public Health, the Army Research Office, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for government purposes notwithstanding any copyright notation herein.

Citation

Download Citation

Joseph Feldman. Daniel R. Kowal. "Bayesian data synthesis and the utility-risk trade-off for mixed epidemiological data." Ann. Appl. Stat. 16 (4) 2577 - 2602, December 2022. https://doi.org/10.1214/22-AOAS1604

Information

Received: 1 February 2021; Revised: 1 January 2022; Published: December 2022
First available in Project Euclid: 26 September 2022

MathSciNet: MR4489224
zbMATH: 1498.62211
Digital Object Identifier: 10.1214/22-AOAS1604

Keywords: copula , data privacy , factor model , Nonparametric regression

Rights: Copyright © 2022 Institute of Mathematical Statistics

JOURNAL ARTICLE
26 PAGES

This article is only available to subscribers.
It is not available for individual sale.
+ SAVE TO MY LIBRARY

Vol.16 • No. 4 • December 2022
Back to Top