Paper
23 January 2012 A synthetic document image dataset for developing and evaluating historical document processing methods
Daniel Walker, William Lund, Eric Ringger
Author Affiliations +
Proceedings Volume 8297, Document Recognition and Retrieval XIX; 829710 (2012) https://doi.org/10.1117/12.912203
Event: IS&T/SPIE Electronic Imaging, 2012, Burlingame, California, United States
Abstract
Document images accompanied by OCR output text and ground truth transcriptions are useful for developing and evaluating document recognition and processing methods, especially for historical document images. Additionally, research into improving the performance of such methods often requires further annotation of training and test data (e.g., topical document labels). However, transcribing and labeling historical documents is expensive. As a result, existing real-world document image datasets with such accompanying resources are rare and often relatively small. We introduce synthetic document image datasets of varying levels of noise that have been created from standard (English) text corpora using an existing document degradation model applied in a novel way. Included in the datasets is the OCR output from real OCR engines including the commercial ABBYY FineReader and the open-source Tesseract engines. These synthetic datasets are designed to exhibit some of the characteristics of an example real-world document image dataset, the Eisenhower Communiqu´es. The new datasets also benefit from additional metadata that exist due to the nature of their collection and prior labeling efforts. We demonstrate the usefulness of the synthetic datasets by training an existing multi-engine OCR correction method on the synthetic data and then applying the model to reduce word error rates on the historical document dataset. The synthetic datasets will be made available for use by other researchers.
© (2012) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Daniel Walker, William Lund, and Eric Ringger "A synthetic document image dataset for developing and evaluating historical document processing methods", Proc. SPIE 8297, Document Recognition and Retrieval XIX, 829710 (23 January 2012); https://doi.org/10.1117/12.912203
Lens.org Logo
CITATIONS
Cited by 4 scholarly publications and 1 patent.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Optical character recognition

Data modeling

Image processing

Error analysis

Calibration

Associative arrays

Binary data

RELATED CONTENT

History of the Tesseract OCR engine what worked and...
Proceedings of SPIE (February 04 2013)
The Bible, truth, and multilingual OCR evaluation
Proceedings of SPIE (January 07 1999)
Multiple-agent adaptation in whole-book recognition
Proceedings of SPIE (January 24 2011)
Character recognition in the presence of occluding clutter
Proceedings of SPIE (January 19 2009)

Back to Top