Automated Linking of Historical Data
Principal Investigator(s): View help for Principal Investigator(s) Ran Abramitzky; Leah Boustan; Katherine Eriksson; James Feigenbaum; Santiago Perez
Version: View help for Version V1
Published: View help for Published Date August 25, 2020
Citation
Project Citation:
Persistent URL: http://doi.org/10.3886/E120703V1
Project Description
- The ABE fully automated approach: This approach is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (2012, 2014, 2017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact.
- A fully automated probabilistic approach (EM): This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis.
Scope of Project
Methodology
Published Versions
Found a serious problem with the data, such as disclosure risk or copyrighted content? Let us know.
This material is distributed exactly as it arrived from the data depositor. ICPSR has not checked or processed this material. Users should consult the investigator(s) if further information is desired.