Toward a format-neutral annotation store
Introduction
Linguists and others who study language have long collected examples of actual language usage, both written, and increasingly spoken, and in the course of their investigations have found it useful to annotate their examples with features that are relevant to their particular research question. Often these annotations are devised and conducted by an individual researcher and are only used for a single research project. But there is increasing desire to share and re-use not only examples of language, but also the annotations that accompany them.
Speech corpora of increasing diversity and size are collected in many different domains of research, and are explored and processed by an overwhelming variety of software tools that aid manual and automatic analysis of speech, for example in the domain of child speech research the Child Language Data Exchange System (CHILDES) project (MacWhinney, 1984) annotated with CLAN (Spektor and Chen, 2012), in the corpus linguistics domain the British National Corpus (BNC) (BNC Consortium, 2007) transcribed using a version of the Text Encoding Initiative (TEI) guidelines (Burnard and Bauman, 2012), in the domain of phonetics the Buckeye corpus (Pitt et al., 2007) released in the XWaves (Hawkings, 2008) format, and in the discourse analysis domain the AMI Corpus (Carletta et al., 2006) released in the NXT format (Kilgour and Carletta, 2006), to name but a few. Recording, transcribing, and annotating speech is an expensive and time-consuming process, and so sharing these resources is desirable.
Brian MacWhinney, one of the driving forces behind the CHILDES project, has explained that the sharing of such language data is important to facilitate further study from the same raw data, and also to promote academic rigour, as findings can be checked against the data from which they are derived (MacWhinney, 2012, Section 3.1 pp. 14–15).
In addition to promoting academic rigour, a great deal of research can be facilitated by the re-use of existing recordings, transcriptions, and annotations. This applies not only to ‘open’ corpora that are available to researchers in different institutions and for different purposes, but also to ‘closed’ corpora that, for participant consent or other reasons, cannot be shared outside their originating institution; new research projects can build on the work of previous projects, allowing corpora to accumulate annotations that are increasingly diverse or refined.
There are many challenges to re-using and sharing linguistic annotation data, and different approaches have been used in order to address these challenges. This paper discusses some of these challenges and solutions.
The structure of this article is as follows: Section 2 discusses in some detail, barriers to the sharing and re-use of linguistic annotations, and some solutions to the problems raised, including the use of a ‘pivot’ annotation model. Section 3 describes various pivot models that have been proposed in the literature, including “Annotation Graphs”, an extension of which we have found useful in the development of a corpus annotation system called “LaBB-CAT”. This software is then described in Section 4, which explains our pivot model extensions, how they solve some outstanding problems, and provide further benefits for annotation processing. Finally, the discussion is summarised and future work is proposed in Section 5.
Section snippets
Sharing, converting, and re-using
When linguistic annotations are created the focus is often on the particular needs of a specific research project. However, facilitating the sharing of annotation data is an increasingly important goal of annotation. New research projects may involve comparisons with past projects’ data, building on the annotation work previously carried out, or merging different corpora into larger collections. Annotation sharing also encompasses the translation of data between two inter-operating systems.
The pivot form
This section describes various possible candidates for the ‘pivot’ form for central storage of annotations.
The goal is to be able to represent any kind of language annotation data, whether it be from the domain of phonetics, sociolinguists, discourse analysis, syntax, child language development, Natural Language Processing (NLP) or any other domain where language may be annotated and those annotations shared. Ide and Romary proposed a set of requirements for such a system (Ide and Romary, 2004,
Developing a format neutral annotation store
In the development of LaBB-CAT, we have found that Annotation Graphs have worked very well to represent a diversity of annotation scenarios. We have previously described the benefits of using this model in Fromont and Hay (2012). However, as discussed in Section 3.2.1, there are problems relating to the lack of specification about how annotations and annotation types relate to each other; this is why Annotation Graphs receive a × for Processability in Table 1.
The following section briefly
Conclusions and future work
Many of the corpus-sharing challenges posed by the diversity of annotation software and formats can be met by using a ‘pivot’ data model for annotation storage. Two-way conversions to tool-specific formats enable round-tripping of data through different tools without data loss, as annotations that are not representable within a given tool are nevertheless still represented in the central annotation store. Structure and alignment issues, ontological and granularity mis-matches, and convention
Acknowledgements
I would like to thank Florian Schiel and the two anonymous reviewers, who provided insightful comments and useful information. I would also like to thank Professor Jennifer Hay for her invaluable feedback while drafting this article, and her ongoing support and collaboration with the development of LaBB-CAT.
I would also like to acknowledge the LaBB-CAT user community for their ongoing feedback, collaboration and support.
References (65)
- et al.
Praat: Doing Phonetics by Computer [Computer Program]
(2016) - et al.
Discourse Transcription
(1992) - et al.
Transana [Computer program]
(2012) - et al.
English Linguistic Guide for The CELEX Lexical Database
(1995) - Barras, C., Boudahmane, K., Manta, M., Antoine, F., Galliano, S., 2008. Transcriber [Computer program]....
TranscriberAG [Computer program]
(2011)NLTK: The Natural Language Toolkit
Proceedings of the COLING/ACL on Interactive Presentation Sessions
(2006)- et al.
Towards A Query Language for Annotation Graphs
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2002)
(2002) - et al.
ATLAS: A flexible and extensible architecture for linguistic annotation
Proceedings of the Second International Conference on Language Resources and Evaluation
(2000) - et al.
A formal framework for linguistic annotation
Technical Report MS-CIS-99-01, Department of Computer and Information Science, University of Pennsylvania
(1999)
A formal framework for linguistic annotation (revised version)
Speech Communication - Special issue on speech annotation and corpus tools archive
Poio API and GraF-XML: a radical stand-off approach in language documentation and language typology
Proceedings of the 2013 Balisage: The Markup Conference
British National Corpus (BNC) XML Edition
The EMU Speech Database System [Computer program]
TEI P5: Guidelines for Electronic TextEncoding and Interchange
The AMI meeting corpus: a pre-announcement
Proceedings of the Second International Conference on Machine Learning for Multimodal Interaction
The alveo virtual laboratory: a web based repository API
Proceedings of the Ninth Language Resources and Evaluation Conference (LREC 2014)
Multi-level annotation in the Emu speech database management system
Speech Commun.
The IMS Corpus Workbench: Corpus Query Processor (CQP) User’s Manual
“kia ora. this is my earthquake story”. multiple applications of a sociolinguistic corpus
Ampersand
Report from TILR working group 1: tools interoperability and input/output formats
Proceedings of the 2007 Working Group Report from the “Toward the Interoperability of Language Resources” Workshop
A framework for standardized syntactic annotation
Proceedings of the 2008 International Conference on Language Resources and Evaluation (LREC 2008)
LAF-Fabric [Computer program]
Speech processing tools – an introduction to interoperability
Proceeding of the 2011 INTERSPEECH
Twenty-first century corpus workbench: updating a query architecture for the new millennium
Proceedings of the 2011 Corpus Linguistics Conference
ONZE Miner: the development of a browser-based research tool
Corpora.
LaBB-CAT: an annotation store
Proceedings of the 2012 Australasian Language Technology Association Workshop
FoLiA: Format for Linguistic Annotation Document, version 1.1.1 Revision 4.5
Technical Report LST-14-01
FoLiA: a practical XML format for linguistic annotation – a descriptive and comparative study
Comput. Linguist. Neth. J.
EXMARaLDA [Computer program]
Introduction to Xwaves+
A corpus representation format for linguistic web services: the D-SPIN text corpus format and its relationship with ISO standards
Proceedings of the 2010 International Conference on Language Resources and Evaluation (LREC 2010)
Cited by (2)
Towards the next generation of speech tools and corpora
2017, Computer Speech and LanguageAnnotations as a support for knowledge generation supporting visual analytics in the field of ophthalmology
2018, VISIGRAPP 2018 - Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications