1 Overview

Context

Commentaries are prominent reference works in classical scholarship. Providing a text with explanatory glosses of various nature (historical, linguistic, mythological, etc.), they often contain highly domain-specific named entities (NEs) and bibliographic references. This specificity—along with the flaws of Optical Character Recognition (OCR) transcriptions and the lack of annotated data—makes this type of document extremely challenging for Natural Language Processing (NLP) tasks such as information extraction and semantic indexing.

To address this issue, we created a multilingual, named entity-annotated corpus of 19th-century commentaries on Sophocles’ Ajax. The corpus was produced in the context of the Ajax Multi-Commentary (AjMC) project in order to support the automatic enrichment of digitised commentaries. An earlier version of this corpus (v. 0.3) was published and used in the context of the 2022 edition of the shared task HIPE – Identifying Historical People, Places and other Entities (); the version of the corpus described in this paper (v. 0.4) improves data quality and includes an additional annotation layer of bibliographic references.

2 Method

To support the creation of this corpus we defined a set of guidelines for the annotation of domain-specific entities (). These guidelines propose a unified approach to the annotation of both traditional and bibliographic entities in Classics publications. The corpus contains two layers of annotations (see Figure 1). The first layer captures information about bibliographic references to primary and secondary sources, following the taxonomy defined by Colavizza and Romanello ().

Figure 1 

An example of annotated sentence.

The second layer contains both universal NEs (PERSON, LOCATION, ORGANIZATION, SCOPE) and some more domain-specific NEs (WORK, SCOPE, OBJECT). This coarse-grained tagset is complemented by a fine-grained tagset defining sub-types for certain entity types, enabling, for example, distinctions between a person being an author (PERSON.AUTHOR) and a person being a mythological character (PERSON.MYTH). Entities are linked to Wikidata (or to a NIL entity if no entry is available), except when they are nested or contained within a secondary bibliographic reference. For entities containing OCR mistakes, a manually corrected transcription is provided, which allows for classifying mentions into OCR quality bands, or to compute the percentage of noisy mentions in the corpus (cf. Table 2).

Annotation procedure and sampling strategy

The pages to annotate were sampled from the introductions and glosses (i.e. the main commentary sections) of five 19th-century commentaries on the Ajax available in the public domain. Commentaries by Schneidewin (1853), Tournier (1866), Campbell (1881), Wecklein (1894) and Jebb (1896) were all chosen because of their importance and their availability. An earlier latin commentary by Lobeck (1835) was excluded from sampling as Latin is an otherwise under-represented language in the AjMC commentary corpus. OCR was performed with Tesseract and annotation was done by two independent annotators using INCEpTION (). Pages that were considered too short (fewer than 100 tokens) or problematic (e.g. presence of errors in the page reading order, missing text due to erroneous layout recognition, etc.) were discarded by the annotators. In order to ensure consistently annotated data, ambiguities arising during annotation were shared and discussed with two curators. Before starting the annotation campaign, annotators and curators annotated a small reference corpus (approx. 3,900 tokens), in order both to familiarise themselves with the annotation guidelines and to test their robustness and clarity. Finally, all produced annotations were reviewed by the two curators in order to ensure consistency across annotators and to correct annotation mistakes.

Quality control

We performed double annotation on a sub-set of approximately 2,000 tokens per language (22 commentary pages and 6,400 tokens in total) in order to calculate the inter-annotator agreement (IAA) rates by using Krippendorf’s α (see Table 1). Overall, both named entity recognition (NER) and entity linking (EL) show good agreement between annotators, with an average IAA rate of 0.81 and 0.88 respectively. Some difficulties of the annotation task which led to disagreements are, for example, the erroneous inclusion of end-of-sentence punctuation as part of the entity mention and, for EL, the presence of homonymous entities in Wikidata, leading to the wrong entity being selected (e.g. Sophocles the Classical playwright vs. the Hellenistic tragic poet named Sophocles).

Table 1

IAA rates computed on a double-annotated sample of the corpus.


SUB-CORPUSNERCEL

English0.830.95

French0.740.87

German0.850.81

Avg.0.810.88

3 Dataset Description

Object name

AjMC-NE-corpus.

Format names and versions

CoNLL-like HIPE TSV format

Creation dates

November 2021 to March 2022.

Dataset creators

Carla Amaya (UNIL, annotation), Kevin Duc (UNIL, annotation), Sven Najem-Meyer (EPFL, data curation), Matteo Romanello (UNIL, data curation & supervision).

Language

Primarily French, German and English, and to a lesser extent Ancient Greek and Latin.

License

CC BY Attribution 4.0 International.

Repository name

GitHub and Zenodo.

Publication date

The current version of the corpus (v. 0.4) was published on 2023-09-01; version 0.3 was published on 2022-05-20 as part of the HIPE-2022 data (v. 2.1).

Corpus statistics

With about 300 annotated pages, 111,218 tokens and 7,482 entity mentions (see Table 2), this corpus is relatively small compared to other corpora (). Nested entities constitute a marginal phenomenon, more frequent in German and French than English. OCR noise affects on average 18.66% of entity mentions, but the French commentary has a rate of noisy entities almost three times higher than English and German, indicating a much lower OCR quality. As to the number of fine-grained mentions (see Table 3), we observe that certain entity types—namely OBJECT, LOC and DATE—are heavily under-represented. Wikidata provides excellent coverage for EL in this corpus as 98.55% of mentions have a corresponding entry in the knowledge base.

Table 2

Basic statistics for the AjMC NE corpus (version 0.4).


LANG.FOLDDOCSTOKENSMENTIONS

ALLFINENESTED%NOISY%NIL

deTrain7622,6951,7381,7381113.810.92

Dev144,701403408211.410.49

Test164,845382382010.991.83

Total10632,2412,5282,5231313.000.99

enTrain6030,9321,8231,823410.971.65

Dev146,506416416016.831.68

Test136,052348348010.342.59

Total8743,4902,5872,587411.831.78

frTrain7224,6691,6211,621930.720.99

Dev175,425391391036.322.56

Test155,391360207027.501.45

Total10435,4872,2192,372931.161.31

Grand Total297111,2167,3347,3342618.661.36

Table 3

The tagset of annotated entities in the AjMC NE corpus (version 0.4); for each entity type, the total number of mentions in the corpus and some selected examples are provided.


COARSE TAG SETFINE TAG SETNB. MENT.LINKINGEXAMPLES

PERSPERS.AUTHOR1,212yes“Sophocles”, “Euripid.”


PERS.EDITOR153“Triclinius”, “Schndw.”


PERS.MYTH933“Tekmessa”, “Ajax”


PERS.OTHER237“Musgrave”, “Perikles”

Total (PERS)2,535

WORKWORK.PRIMLIT1,566yes“O. T.”, “Iliad”


WORK.SECLIT82“Lexicon Sophocleum”, “L. and S.”


WORK.FRAGM12“Frag. adesp.”, “fragm.”


WORK.JOURNAL1“the Cambridge Journal of Philology”


WORK.OTHER3

Total (WORK)1,664

OBJECTOBJECT.MANUSCR25no“Laurentianus A”

LOC109yes“Athènes”, “Salamisinsel”

DATE26no“um 770 v. Chr.”, “A.D. 1618”

SCOPE2,975no“1340 f.”, “1083”

Usage of the Corpus in the HIPE-2022 Shared Task

The dataset was featured in two of the challenges into which the HIPE-2022 shared task () was organised: the Multilingual Classical Commentary Challenge and the Global Adaptation Challenge. While the former focused on commentaries, the latter aimed at testing the ability of participating systems to handle both commentaries and newspapers written in at least two languages.

NER results were reassuring. Despite considerable OCR noise, the best systems reached an overall F1-score of 85.4% for English, 84.2% for French and 93.4% for German on the coarse tagset.

Much work remains to be done, however, in the EL task; here, the best F1-score for the linking of pre-extracted mentions is 38.1% for English, 47% for French and 50.3% for German.

Two other factors compound with OCR noise make EL on this corpus an extremely challenging task. First, the style of commentary writing favours conciseness and uses abbreviations abundantly (approx. 47% of all entity mentions are abbreviated). Secondly, abbreviations rely heavily on context: in a commentary on a tragedy by Sophocles, the commentator will refer to the tragedy Philoctetes as simply Ph. (instead of Philoct.), thus making such abbreviations hard to resolve for EL systems.

4 Reuse Potential

Despite being the first named entity-annotated corpus—to the best of our knowledge—to contain classical commentaries, it has certain limitations. Firstly, given the research context it originates from, the selection of commentaries is limited to Attic tragedy; to make this corpus more generic, commentaries to works of Ancient Greek prose, as well as to works of Latin prose and poetry should be considered too. Secondly, some entity types are heavily under-represented (see Table 3), a limitation which could be mitigated by applying data augmentation or meta-learning approaches to training.

The main anticipated use of this corpus is to train and evaluate domain-specific models for NER and EL on historical documents. In fact, the sampling strategy adopted—both in the selection of commentaries and of page sections to annotate (introduction and glosses)—has led to a dataset which is not complete nor representative enough to study higher level characteristics of the commentary genre (e.g. by means of textual analysis).

Instead, due to the very nature of the data, this corpus is particularly suitable for testing the adaptability of NER systems to noisy, multilingual and multiscript texts. The density of abbreviated entity mentions also makes this corpus an excellent testbed for evaluating the ability of EL systems to deal with domain-specific—and oftentimes cryptic—abbreviations. Furthermore, the bibliographic reference annotation layer contains a particularly rich set of primary source references. As recent initiatives demonstrate, bibliographic reference extraction in the humanities is far from a solved problem, and the scarcity of available data for this task is a real issue (). Hopefully, this dataset will help alleviating this issue in the domain of Classics.

Finally, based on the encouraging results obtained on noisy and highly abbreviated texts such as commentaries, it is reasonable to expect that NER models trained on this corpus will perform fairly well on commentaries, books and journal articles which are either born digital or have better OCR quality.