MiBio: A dataset for OCR post-processing evaluation

We introduce a dataset for OCR post-processing model evaluation. This dataset contains fully aligned OCR texts and the ground truth recognition texts of a English biodiversity book. To better used for benchmark evaluation, we extracted the following information in TSV files: 1) 2907 OCR-generated errors with position in the OCR texts and correction in the ground truth text, 2) ground truth word and sentence segmentation of the OCR texts. In this article, we detail the data preprocessing and provide quantitative data analysis.


a b s t r a c t
We introduce a dataset for OCR post-processing model evaluation. This dataset contains fully aligned OCR texts and the ground truth recognition texts of a English biodiversity book. To better used for benchmark evaluation, we extracted the following information in TSV files: 1) 2907 OCR-generated errors with position in the OCR texts and correction in the ground truth text, 2) ground truth word and sentence segmentation of the OCR texts. In this article, we detail the data preprocessing and provide quantitative data analysis.
& 2018 Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Subject area
Computer Science More specific subject area Natural Language Processing Type of data Text, Table  How  We provide the ground truth word and sentence segmentation for OCR texts to disambiguate word and sentence boundary and to be served as a reference when evaluating the tokenization performance of post-processing models.

Data
We made available Mining Biodiversity (MiBio) dataset with 2910 OCR-generated errors along with the OCR and the ground truth recognition texts for benchmark testing. The OCR text was generated from the book titled "Birds of Great Britain and Ireland (Volume II)" [1] and made publicly available by the Biodiversity Heritage Library (BHL) for Europe 1 using Tesseract 3.0.23 2 . The ground truth text is based on an improved OCR output 3 and adjusted manually to match with the original content of the whole book.
The scanned image data of the book contains 460 page-separated files, where the main content is included in 211 pages. The scanned images and different format of raw OCR outputs are online accessible and downloadable on https://archive.org/download/birdsofgreatbrit02butl.

OCR and ground truth recognition texts preprocessing
The dataset is generated from two OCR outputs for book "Birds of Great Britain and Ireland (Volume II)" [1]. One version is generated from the standard BHL-Europe recognition workflow, which OCR technique is based on Tesseract 3.0.23. We manually correct the OCR errors in the OCR outputs to be the ground truth. We then remove footnotes and page numbers in both versions to keep the content fluency over pages.

OCR error extraction
When generating the error list, we adopted the following rules in extracting the OCR errors in aligned contents from the OCR and the ground truth texts: when segmenting an OCR-generated string into substrings that match with tokens in the ground truth text, the separating positions are approximated manually to make the best guess. For example, given an OCR string "fFrin^HluurJ" aligns with "(Fringillinae)" in the ground truth, we separated this string into three error-correction mappings: o f -) 4 , o Frin^HluurJ -Fringillinae 4, and o J -) 4. In another example, given an OCR string "countr}^" and "country," in the ground truth, we split it as two error-correction mapping: o countr}^-country 4 and o -, 4.
Two ASCII substitution of unicode characters are allowed: (ae, ae) and (AE, AE). Note that the dataset is generated from a biodiversity book, which contains terminologies with non-English characters, for example, Corvidae or ORIOLIDAE. We accept these two ASCII substitutions in order to match the original terminologies to their English counterparts.
The aligned two words with different cases is not treated as an error. Observed that the standard BHL-Europe recognition workflow is tend to lowercase the non-heading characters in some entirely capitalized words. Thus, we do not categorized this type of mismatchings as error. Such change in capitalization form is also hard to detect by human readers with only input text when page layout is eliminated.
The extra whitespaces between tokens are allowed. It is also observed that the standard BHL-Europe recognition workflow generate extra whitespaces between tokens. We do not categorize this type of mismatch as error unless the inserted whitespace leads to a splitting or merging error.

OCR text tokenization
Tokenizing OCR text is one internal step in OCR post-processing. The tokenization performance affect downstream error detection and correction. Since intra-word characters of OCR errors can be misrecognized as punctuation, it is hard to disambiguate the misrecognized punctuation with true punctuation in an OCR text and thus lead to high token boundary ambiguities. We thus provide the ground truth OCR tokens for evaluating the tokenization performance of OCR post-processing models. The ground truth tokens are generated by first tokenizing the ground truth recognition text and maps the segmentation positions to the OCR texts.
Referencing to the ground truth OCR tokens in the dataset, we quantitative analysis the tokenization performance on the OCR texts by tokenization different schemes including the Whitespace, Penn Treebank, WASTE [2] and Elephant [4]. The results are shown in Table 2. The tokenization result shows that the correct word boundaries of OCR errors are hard to be identified by man-crafted rules or trained segmentation models.

Dataset analysis
To have a close look at the OCR input/output, we sample a segment of OCR-generated text with original scanned image In Fig. 1. Table 1 shows the OCR performance, measured by precision and recall, indicating a high quality OCR output with low error rate in both word-and character-level measurements.
Observed that some OCR errors are orthographically far from their correction, we further analyze the distribution of error words with respect to Levenshtein edit distance [3] in Table 3. Although within edit distance three induces more than 80% of the OCR errors, some OCR errors have high edit distance and are very complicated to correct. Table 1 The precision and recall of OCR generated text.

Measure
Character-wise Word-wise