Citation

Project Citation: 

Abramitzky, Ran, Boustan, Leah, Eriksson, Katherine, Feigenbaum, James, and Perez, Santiago. Automated Linking of Historical Data. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2020-08-25. https://doi.org/10.3886/E120703V1

Persistent URL:  http://doi.org/10.3886/E120703V1

Project Description

Project Title:  View help for Project Title Automated Linking of Historical Data
Summary:  View help for Summary Currently, the repository provides codes for two such methods:
  1. The ABE fully automated approach: This approach is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (201220142017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact.
  2. A fully automated probabilistic approach (EM): This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis.
Original Distribution URL:  View help for Original Distribution URL https://ranabr.people.stanford.edu/matching-codes

Scope of Project

Subject Terms:  View help for Subject Terms census data; historical; linking; record linkage
Geographic Coverage:  View help for Geographic Coverage United States
Time Period(s):  View help for Time Period(s) 1850 ? 1940
Data Type(s):  View help for Data Type(s) census/enumeration data

Methodology

Data Source:  View help for Data Source Complete-count US Census data (1850-1940)

Name Size File Type Download/
Preview

Published Versions

Export Metadata

Report a Problem

Found a serious problem with the data, such as disclosure risk or copyrighted content? Let us know.

This material is distributed exactly as it arrived from the data depositor. ICPSR has not checked or processed this material. Users should consult the investigator(s) if further information is desired.