30 Years of Synthetic Data

Jörg Drechsler; Anna-Carolina Haensch

doi:10.1214/24-STS927

May 2024 30 Years of Synthetic Data

Jörg Drechsler, Anna-Carolina Haensch

Author Affiliations +

Statist. Sci. 39(2): 221-242 (May 2024). DOI: 10.1214/24-STS927

Abstract

The idea to generate synthetic data as a tool for broadening access to sensitive microdata has been proposed for the first time three decades ago. While first applications of the idea emerged around the turn of the century, the approach really gained momentum over the last ten years, stimulated at least in parts by some recent developments in computer science. We consider the 30th jubilee of Rubin’s seminal paper on synthetic data (J. Off. Stat. 9 (1993) 462–468) as an opportunity to look back at the historical developments but also to offer a review of the diverse approaches and methodological underpinnings proposed over the years. We will also discuss the various strategies that have been suggested to measure the utility and remaining risk of disclosure of the generated data.

Funding Statement

This work was partially supported by the German Federal Institute for Drugs and Medical Devices and a grant from the German Federal Ministry of Education and Research (grant number 16KISA096) with funding from the European Union—NextGenerationEU.

Acknowledgments

The authors are grateful for helpful feedback on an earlier version of this paper from the FK2RG group at Mannheim University and LMU Munich. The authors also acknowledge very valuable feedback from three anonymous referees and an Associate Editor.