Published November 10, 2023 | Version v1
Dataset Open

Data for "Modular Bibliographical Profiling of Historic Book Reviews"

  • 1. ROR icon Denison University

Contributors

  • 1. ROR icon Denison University
  • 2. ROR icon University of Pittsburgh

Description

This dataset supports the research paper, ""Modular Bibliographical Profiling of Historic Book Reviews." The paper examines different methods of predicting bibliographical details (e.g. author, title, and publisher) of books under review in a corpus of approximately 1,100 historical book reviews. The dataset is comprised of book reviews from ProQuest's American Periodicals Series (APS). This kind of bibliographical profiling is often characterized as a Natural Language Processing (NLP) or Named Entity Recognition (NER) task, but it can be more specifically described as a two-part Named Entity Linking (NEL) task, beginning with a feature extraction stage followed by one of several available matching or classification methods. An attempt has been made to formalize constraints for modular bibliographical profiling (MBP) and shed light on some important choices that are often glossed over or obscured by digital humanities (DH) practitioners. Applying these constraints, the paper evaluates combinations of feature selection (naive bag-of-words [BOW], rule-based feature extraction, and NER using a pre-trained model) with a standardized similarity-based matching strategy (cosine similarity). All tasks are performed on derived text data (term frequency tables), so that data can be shared and all methods can be used on materials available only in non-consumptive formats. These comparisons suggest that naive BOW can perform quite robustly, and that using even a basic pretrained NER model in conjunction with a BOW approach may reduce false positives. 

Files

MBP-data.zip

Files (28.9 MB)

Name Size Download all
md5:91da89453d2dab77ec88927a6c0f19f6
28.9 MB Preview Download

Additional details

Dates

Created
2019-08-01/2023-11-01