Hostname: page-component-8448b6f56d-cfpbc Total loading time: 0 Render date: 2024-04-18T17:31:00.077Z Has data issue: false hasContentIssue false

Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records

Published online by Cambridge University Press:  02 January 2019

TED ENAMORADO*
Affiliation:
Princeton University
BENJAMIN FIFIELD*
Affiliation:
Princeton University
KOSUKE IMAI*
Affiliation:
Harvard University
*
*Ted Enamorado, Ph.D. Candidate, Department of Politics, Princeton University, tede@princeton.edu, http://www.tedenamorado.com.
Benjamin Fifield, Ph.D. Candidate, Department of Politics, Princeton University, bfifield@princeton.edu, http://www.benfifield.com.
Kosuke Imai, Professor, Department of Government and Department of Statistics, Harvard University. imai@harvard.edu, https://imai.fas.harvard.edu.

Abstract

Since most social science research relies on multiple data sources, merging data sets is an essential part of researchers’ workflow. Unfortunately, a unique identifier that unambiguously links records is often unavailable, and data may contain missing and inaccurate information. These problems are severe especially when merging large-scale administrative records. We develop a fast and scalable algorithm to implement a canonical model of probabilistic record linkage that has many advantages over deterministic methods frequently used by social scientists. The proposed methodology efficiently handles millions of observations while accounting for missing data and measurement error, incorporating auxiliary information, and adjusting for uncertainty about merging in post-merge analyses. We conduct comprehensive simulation studies to evaluate the performance of our algorithm in realistic scenarios. We also apply our methodology to merging campaign contribution records, survey data, and nationwide voter files. An open-source software package is available for implementing the proposed methodology.

Type
Research Article
Copyright
Copyright © American Political Science Association 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

The proposed methodology is implemented through an open-source R package, fastLink: Fast Probabilistic Record Linkage, which is freely available for download at the Comprehensive R Archive Network (CRAN; https://CRAN.R-project.org/package=fastLink). We thank Bruce Willsie of L2 and Steffen Weiss of YouGov for data and technical assistance, Jake Bowers, Seth Hill, Johan Lim, Marc Ratkovic, Mauricio Sadinle, five anonymous reviewers, and audiences at the 2017 Annual Meeting of the American Political Science Association, Columbia University (Political Science), Fifth Asian Political Methodology Meeting, Gakusyuin University (Law), Hong Kong University of Science and Technology, the Institute for Quantitative Social Science (IQSS) at Harvard University, the Quantitative Social Science (QSS) colloquium at Princeton University, Universidad de Chile (Economics), Universidad del Desarrollo, Chile (Government), the 2017 Summer Meeting of the Society for Political Methodology, the Center for Statistics and the Social Sciences (CSSS) at the University of Washington for useful comments and suggestions. Replication materials can be found on Dataverse at: https://doi.org/10.7910/DVN/YGUHTD.

References

REFERENCES

Adena, Maja, Enikolopov, Ruben, Petrova, Maria, Santarosa, Veronica, and Zhuravskaya, Ekaterina. 2015. “Radio and the Rise of the Nazis in Prewar Germany.” Quarterly Journal of Economics 130: 1885–939.CrossRefGoogle Scholar
Ansolabehere, Stephen, and Hersh, Eitan. 2012. “Validation: What Big Data Reveal about Survey Misreporting and the Real Electorate.” Political Analysis 20: 437–59.CrossRefGoogle Scholar
Ansolabehere, Stephen, and Hersh, Eitan. 2017. “ADGN: An Algorithm for Record Linkage Using Address, Date of Birth, Gender and Name.” Statistics and Public Policy 4: 110.CrossRefGoogle Scholar
Belin, Thomas R., and Rubin, Donald B.. 1995. “A Method for Calibrating False-Match Rates in Record Linkage.” Journal of the American Statistical Association 90: 694707.CrossRefGoogle Scholar
Berent, Matthew K., Krosnick, Jon Arthur, and Lupia, A.. 2016. “Measuring Voter Registration and Turnout in Surveys. Do Official Government Records Yield More Accurate Assessments?Public Opinion Quarterly . 80: 597621.CrossRefGoogle Scholar
Bolsen, Toby, Ferraro, Paul J., and Miranda, Juan Jose. 2014. “Are Voters More Likely to Contribute to Other Public Goods? Evidence from a Large-Scale Randomized Policy Experiment.” American Journal of Political Science 58: 1730.CrossRefGoogle Scholar
Bonica, Adam. 2013. Database on Ideology, Money in Politics, and Elections: Public Version 1.0 [Computer File]. Stanford, CA: Stanford University Libraries.Google Scholar
Cesarini, David, Lindqvist, Erik, Ostling, Robert, and Wallace, Bjorn. 2016. “Wealth, Health, and Child Development: Evidence from Administrative Data on Swedish Lottery Players.” Quarterly Journal of Economics 131: 687738.CrossRefGoogle Scholar
Christen, Peter. 2012. Data Matching. Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection . Heidelberg, Germany: Springer.Google Scholar
Cohen, William W., Ravikumar, Pradeep, and Fienberg, Stephen. 2003. “A Comparison of String Distance Metrics for Name-Matching Tasks.” In International Joint Conference on Artificial Intelligence (IJCAI) 18.Google Scholar
Cross, Philip J., and Manski, Charles F.. 2002. “Regressions, Short and Long.” Econometrica 70: 357–68.CrossRefGoogle Scholar
Dalzell, Nicole M., and Reiter, Jerome P.. 2018. “Regression Modeling and File Matching Using Possibly Erroneous Matching Variables.” Journal of Computational and Graphical Statistics 111. Published online July 11, 2018.Google Scholar
de Bruin, Jonathan. 2017. “Record Linkage. Python library. Version 0.8.1.” https://recordlinkage.readthedocs.io/.Google Scholar
Einav, Liran, and Levin, Jonathan. 2014. “Economics in the Age of Big Data.” Science 346 (6210): 1243089-16.Google ScholarPubMed
Enamorado, Ted. 2018. “Active Learning for Probabilisitic Record Linkage.” Social Science Research Network (SSRN). URL: https://ssrn.com/abstract=3257638.Google Scholar
Engbom, Niklas, and Moser, Christian. 2017. “Returns to Education Through Access to Higher-Paying Firms: Evidence from US Matched Employer-Employee Data.” American Economic Review: Papers and Proceedings 107: 374–78.CrossRefGoogle Scholar
Feigenbaum, James. 2016. Automated Census Record Linking: A Machine Learning Approach. Boston University. Technical Report. https://jamesfeigenbaum.github.io/research/pdf/census-link-ml.pdfGoogle Scholar
Fellegi, Ivan P., and Sunter, Alan B.. 1969. “A Theory of Record Linkage.” Journal of the American Statistical Association 64: 1183–210.CrossRefGoogle Scholar
Figlio, David, and Guryan, Jonathan. 2014. “The Effects of Poor Neonatal Health on Children’s Cognitive Development.” American Economic Review 104: 3921–55.CrossRefGoogle ScholarPubMed
Giraud-Carrier, Christophe, Goodlife, Jay, Jones, Bradley M., and Cueva, Stacy. 2015. “Effective Record Linkage for Mining Campaign Contribution Data.” Knowledge and Information Systems 45: 389416.CrossRefGoogle Scholar
Goldstein, Harvey, and Harron, Katie. 2015. Methodological Developments in Data Linkage . John Wiley & Sons, Ltd. Chapter 6: Record Linkage: A Missing Data Problem, pp. 109–24.Google Scholar
Gutman, Roee, Afendulis, Christopher C., and Zaslavsky, Alan M.. 2013. “A Bayesian Procedure for File Linking to End-of-Life Medical Costs.” Journal of the American Medical Informatics Association . 103: 3447.Google Scholar
Gutman, Roee, Sammartino, Cara J., Green, Traci C., and Montague, Brian T.. 2016. “Error Adjustments for File Linking Methods Using Encrypted Unique Client Identifier (eUCI) with Application to Recently Released Prisoners Who Are HIV+.” Statistics in Medicine 35: 115–29.CrossRefGoogle ScholarPubMed
Harron, Katie, Goldstein, Harvey, and Dibben, Chris, eds. 2015. Methodological Developments in Data Linkage. West Sussex: John Wiley & Sons.Google Scholar
Hersh, Eitan D. 2015. Hacking the Electorate: How Campaigns Perceive Voters. Cambridge, U.K.: Cambridge University Press.CrossRefGoogle Scholar
Herzog, Thomas H., Scheuren, Fritz, and Winkler, William E.. 2010. “Record Linkage.” Wiley Interdisciplinary Reviews: Computational Statistics 2: 535–43.CrossRefGoogle Scholar
Herzog, Thomas N., Scheuren, Fritz J., and Winkler, William E.. 2007. Data Quality and Record Linkage Techniques. New York: Springer.Google Scholar
Hill, Seth. 2017. “Changing Votes or Changing Voters: How Candidates and Election Context Swing Voters and Mobilize the Base.” Electoral Studies 48: 131–48.CrossRefGoogle Scholar
Hill, Seth J., and Huber, Gregory A.. 2017. “Representativeness and Motivations of the Contemporary Donorate: Results from Merged Survey and Administrative Records.” Political Behavior 39: 329.CrossRefGoogle Scholar
Hof, Michel H. P., and Zwinderman, Aeilko H.. 2012. “Methods for Analyzing Data from Probabilistic Linkage Strategies Based on Partially Identifying Variables.” Statistics in Medicine 31: 4231–42.CrossRefGoogle ScholarPubMed
Imai, Kosuke, and Tingley, Dustin. 2012. “A Statistical Method for Empirical Testing of Competing Theories.” American Journal of Political Science 56: 218–36.CrossRefGoogle Scholar
Jaro, Matthew. 1972. “UNIMATCH-A Computer System for Generalized Record Linkage Under Conditions of Uncertainty.” Technical Report, Spring Joint Computer Conference.Google Scholar
Jaro, Matthew. 1989. “Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida.” Journal of the American Statistical Association . 84: 414–20.CrossRefGoogle Scholar
Jutte, Douglas P., Roos, Leslie L., and Browne, Marni D.. 2011. “Administrative Record Linkage as a Tool for Public Health Research.” Annual Review of Public Health 32: 91108.Google ScholarPubMed
Kim, Gunky, and Chambers, Raymond. 2012. “Regression Analysis under Incomplete Linkage.” Computational Statistics and Data Analysis 56: 2756–70.CrossRefGoogle Scholar
Lahiri, Partha, and Larsen, Michael D.. 2005. “Regression Analysis with Linked Data.” Journal of the American Statistical Association 100: 222–30.CrossRefGoogle Scholar
Larsen, Michael D., and Rubin, Donald B.. 2001. “Iterative Automated Record Linkage Using Mixture Models.” Journal of the American Statistical Association 96: 3241.CrossRefGoogle Scholar
McLaughlan, Geoffrey, and Peel, David. 2000. Finite Mixture Models. New York: John Wiley & Sons.CrossRefGoogle Scholar
McVeigh, Brendan S., and Murray, Jared S.. 2017. “Practical Bayesian Inference for Record Linkage.” Technical Report, Carnegie Mellon University.Google Scholar
Meredith, Marc, and Morse, Michael. 2014. “Do Voting Rights Notification Laws Increase Ex-Felon Turnout?The ANNALS of the American Academy of Political and Social Science 651: 220–49.CrossRefGoogle Scholar
Mummolo, Jonathan, and Nall, Clayton. 2016. “Why Partisans Don’t Sort: The Constraints on Political Segregation.” Journal of Politics 79: 4559.CrossRefGoogle Scholar
Murray, Jared S. 2016. “Probabilistic Record Linkage and Deduplication after Indexing, Blocking, and Filtering.” Journal of Privacy and Confidentiality 7: 324.Google Scholar
Neter, John, Maynes, E. Scott, and Ramanathan, R.. 1965. “The Effect of Mismatching on the Measurement of Response Errors.” Journal of the American Statistical Association 60: 1005–27.Google Scholar
Ong, Toan C., Mannino, Michael V., Schilling, Lisa M., and Kahn, Michael G.. 2014. “Improving Record Linkage Performance in the Presence of Missing Linkage Data.” Journal of Biomedical Informatics 52: 4354.CrossRefGoogle ScholarPubMed
Richman, Jesse T., Chattha, Gulshan A., and Earnest, David C.. 2014. “Do Non-Citizens Vote in U.S. Elections?Electoral Studies 36: 149–57.CrossRefGoogle Scholar
Ridder, Geert, and Moffitt, Robert. 2007. Handbook of Econometrics. Vol. 6. Elsevier Chapter The Econometrics of Data Combination, pp. 5469–547.Google Scholar
Sadinle, Mauricio. 2014. “Detecting Duplicates in a Homicide Registry Using a Bayesian Partitioning Approach.” Annals of Applied Statistics . 8: 2404–34.CrossRefGoogle Scholar
Sadinle, Mauricio. 2017. “Bayesian Estimation of Bipartite Matchings for Record Linkage.” Journal of the American Statistical Association 112: 600–12.CrossRefGoogle Scholar
Sariyar, Murat, and Borg, Andreas. 2016. Record Linkage in R. R package. Version 0.4-10. http://cran.r-project.org/package=RecordLinkage.Google Scholar
Sariyar, Murat, Borg, Andreas, and Pommerening, Klaus. 2012. “Missing Values in Deduplication of Electronic Patient Data.” Journal of the American Medical Informatics Association 19: e76–82.CrossRefGoogle ScholarPubMed
Scheuren, Fritz, and Winkler, William E.. 1993. “Regression Analysis of Data Files that Are Computer Matched.” Survey Methodology 19: 3958.Google Scholar
Scheuren, Fritz, and Winkler, William E.. 1997. “Regression Analysis of Data Files that Are Computer Matched II.” Survey Methodology . 23: 157–65.Google Scholar
Steorts, Rebecca C. 2015. “Entity Resolution with Empirically Motivated Priors.” Bayesian Analysis . 10: 849–75.CrossRefGoogle Scholar
Steorts, Rebecca C., Ventura, Samuel L., Sadinle, Mauricio, and Fienberg, Stephen E.. 2014. “A Comparison of Blocking Methods for Record Linkage.” In Privacy in Statistical Databases, ed. Domingo-Ferrer, Josep. Springer, 253–68.Google Scholar
Tam Cho, Wendy, Gimpel, James, and Hui, Iris. 2013. “Voter Migration and the Geographic Sorting of the American Electorate.” Annals of the American Association of Geographers 103: 856–70.CrossRefGoogle Scholar
Tancredi, Andrea, and Liseo, Brunero. 2011. “A Hierachical Bayesian Approach to Record Linkage and Population Size Problems.” Annals of Applied Statistics . 5: 1553–85.CrossRefGoogle Scholar
Thibaudeau, Yves. 1993. “The Discrimination Power of Dependency Structures in Record Linkage.” Survey Methodology 19.Google Scholar
Winkler, William E. 1990. “String Comparator Metrics and Enhanced Decision Rules in the Fellegi–Sunter Model of Record Linkage.” In Proceedings of the Section on Survey Research Methods. American Statistical Association. https://www.iser.essex.ac.uk/research/publications/501361.Google Scholar
Winkler, William E. 1993. “Improved Decision Rules in the Fellegi–Sunter Model of Record Linkage.” In Proceedings of Survey Research Methods Section. American Statistical Association. http://ww2.amstat.org/sections/srms/Proceedings/papers/1993_042.pdf.Google Scholar
Winkler, William E. 2000. “Using the EM Algorithm for Weight Computation in the Felligi–Sunter Model of Record Linkage.” Technical Report No. RR2000/05, Statistical Research Division, Methodology and Standards Directorate, U.S. Bureau of the Census.Google Scholar
Winkler, William E. 2005. “Approximate String Comparator Search Strategies for Very Large Administrative Lists.” Research Report Series (Statistics) No. 2005-02, Statistical Research Division U.S. Census Bureau.Google Scholar
Winkler, William E. 2006a. “Automatic Estimation of Record Linkage False Match Rates.” In Proceedings of the Section on Survey Research Methods. American Statistical Association.Google Scholar
Winkler, William E. 2006b. “Overview of Record Linkage and Current Research Directions.” Technical Report, United States Bureau of the Census.Google Scholar
Winkler, William E., and Yancey, Willian. 2006. “Record Linkage Error-Rate Estimation without Training Data.” In Proceedings of the Section on Survey Research Methods. American Statistical Association.Google Scholar
Winkler, William E., Yancey, Willian, and Porter, E. H.. 2010. “Fast Record Linkage of Very Large Files in Support of the Decennial and Administrative Record Projects.” In Proceedings of the Secion on Survey Research Methods.Google Scholar
Yancey, Willian. 2005. “Evaluating String Comparator Performance for Record Linkage.” Research Report Series, Statistical Research Division U.S. Census Bureau.Google Scholar
Supplementary material: PDF

Enamorado et al. supplementary material

Enamorado et al. supplementary material 1

Download Enamorado et al. supplementary material(PDF)
PDF 471.7 KB
Supplementary material: PDF

Enamorado et al. supplementary material

Enamorado et al. supplementary material 2

Download Enamorado et al. supplementary material(PDF)
PDF 262.9 KB
Supplementary material: Link

Enamorado et al. Dataset

Link