A synthetic document image dataset for developing and evaluating historical document processing methods

Daniel Walker; William Lund; Eric Ringger

doi:10.1117/12.912203

23 January 2012 A synthetic document image dataset for developing and evaluating historical document processing methods

Daniel Walker, William Lund, Eric Ringger

Proceedings Volume 8297, Document Recognition and Retrieval XIX; 829710 (2012) https://doi.org/10.1117/12.912203
Event: IS&T/SPIE Electronic Imaging, 2012, Burlingame, California, United States

Abstract

Document images accompanied by OCR output text and ground truth transcriptions are useful for developing and evaluating document recognition and processing methods, especially for historical document images. Additionally, research into improving the performance of such methods often requires further annotation of training and test data (e.g., topical document labels). However, transcribing and labeling historical documents is expensive. As a result, existing real-world document image datasets with such accompanying resources are rare and often relatively small. We introduce synthetic document image datasets of varying levels of noise that have been created from standard (English) text corpora using an existing document degradation model applied in a novel way. Included in the datasets is the OCR output from real OCR engines including the commercial ABBYY FineReader and the open-source Tesseract engines. These synthetic datasets are designed to exhibit some of the characteristics of an example real-world document image dataset, the Eisenhower Communiqu´es. The new datasets also benefit from additional metadata that exist due to the nature of their collection and prior labeling efforts. We demonstrate the usefulness of the synthetic datasets by training an existing multi-engine OCR correction method on the synthetic data and then applying the model to reduce word error rates on the historical document dataset. The synthetic datasets will be made available for use by other researchers.

Citation Download Citation

Daniel Walker, William Lund, and Eric Ringger "A synthetic document image dataset for developing and evaluating historical document processing methods", Proc. SPIE 8297, Document Recognition and Retrieval XIX, 829710 (23 January 2012); https://doi.org/10.1117/12.912203

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available