Data for "Modular Bibliographical Profiling of Historic Book Reviews"
Contributors
Data curators:
Description
This dataset supports the research paper, ""Modular Bibliographical Profiling of Historic Book Reviews." The paper examines different methods of predicting bibliographical details (e.g. author, title, and publisher) of books under review in a corpus of approximately 1,100 historical book reviews. The dataset is comprised of book reviews from ProQuest's American Periodicals Series (APS). This kind of bibliographical profiling is often characterized as a Natural Language Processing (NLP) or Named Entity Recognition (NER) task, but it can be more specifically described as a two-part Named Entity Linking (NEL) task, beginning with a feature extraction stage followed by one of several available matching or classification methods. An attempt has been made to formalize constraints for modular bibliographical profiling (MBP) and shed light on some important choices that are often glossed over or obscured by digital humanities (DH) practitioners. Applying these constraints, the paper evaluates combinations of feature selection (naive bag-of-words [BOW], rule-based feature extraction, and NER using a pre-trained model) with a standardized similarity-based matching strategy (cosine similarity). All tasks are performed on derived text data (term frequency tables), so that data can be shared and all methods can be used on materials available only in non-consumptive formats. These comparisons suggest that naive BOW can perform quite robustly, and that using even a basic pretrained NER model in conjunction with a BOW approach may reduce false positives.
Files
MBP-data.zip
Files
(28.9 MB)
Name | Size | Download all |
---|---|---|
md5:91da89453d2dab77ec88927a6c0f19f6
|
28.9 MB | Preview Download |
Additional details
Dates
- Created
-
2019-08-01/2023-11-01