Elsevier

Computer Speech & Language

Volume 45, September 2017, Pages 348-374
Computer Speech & Language

Toward a format-neutral annotation store

https://doi.org/10.1016/j.csl.2017.01.004Get rights and content

Highlights

  • Speech corpora are difficult to share and reuse because of format incompatibilities.

  • Various central ‘pivot’ models for annotations have been proposed as a solution.

  • LaBB-CAT uses Annotation Graphs with extensions to formalise annotation structure.

  • An API for the resulting model can be used for query, annotation and conversion.

Abstract

Sharing speech corpora and their annotations is desirable, in order to maximise the value gained from the expense and hard work involved in transcribing and annotating them. However, differences in conventions and format are barriers to sharing of data; text conventions conflict, file formats differ, and annotation ontologies do not match up. Using a ‘pivot’ form to store annotations in a tool and format neutral manner can alleviate many of these difficulties. There are several possibilities for the pivot form, including the Annotation Graph model, which meets most of the requirements to be a pivot. The LaBB-CAT software’s implementation of Annotation Graphs incorporates some extensions to the model, which handle the remaining unmet requirements, and create the possibility of defining an annotation API that makes automation of conversion, querying, and manipulation of annotations easier.

Introduction

Linguists and others who study language have long collected examples of actual language usage, both written, and increasingly spoken, and in the course of their investigations have found it useful to annotate their examples with features that are relevant to their particular research question. Often these annotations are devised and conducted by an individual researcher and are only used for a single research project. But there is increasing desire to share and re-use not only examples of language, but also the annotations that accompany them.

Speech corpora of increasing diversity and size are collected in many different domains of research, and are explored and processed by an overwhelming variety of software tools that aid manual and automatic analysis of speech, for example in the domain of child speech research the Child Language Data Exchange System (CHILDES) project (MacWhinney, 1984) annotated with CLAN (Spektor and Chen, 2012), in the corpus linguistics domain the British National Corpus (BNC) (BNC Consortium, 2007) transcribed using a version of the Text Encoding Initiative (TEI) guidelines (Burnard and Bauman, 2012), in the domain of phonetics the Buckeye corpus (Pitt et al., 2007) released in the XWaves (Hawkings, 2008) format, and in the discourse analysis domain the AMI Corpus (Carletta et al., 2006) released in the NXT format (Kilgour and Carletta, 2006), to name but a few. Recording, transcribing, and annotating speech is an expensive and time-consuming process, and so sharing these resources is desirable.

Brian MacWhinney, one of the driving forces behind the CHILDES project, has explained that the sharing of such language data is important to facilitate further study from the same raw data, and also to promote academic rigour, as findings can be checked against the data from which they are derived (MacWhinney, 2012, Section 3.1 pp. 14–15).

In addition to promoting academic rigour, a great deal of research can be facilitated by the re-use of existing recordings, transcriptions, and annotations. This applies not only to ‘open’ corpora that are available to researchers in different institutions and for different purposes, but also to ‘closed’ corpora that, for participant consent or other reasons, cannot be shared outside their originating institution; new research projects can build on the work of previous projects, allowing corpora to accumulate annotations that are increasingly diverse or refined.

There are many challenges to re-using and sharing linguistic annotation data, and different approaches have been used in order to address these challenges. This paper discusses some of these challenges and solutions.

The structure of this article is as follows: Section 2 discusses in some detail, barriers to the sharing and re-use of linguistic annotations, and some solutions to the problems raised, including the use of a ‘pivot’ annotation model. Section 3 describes various pivot models that have been proposed in the literature, including “Annotation Graphs”, an extension of which we have found useful in the development of a corpus annotation system called “LaBB-CAT”. This software is then described in Section 4, which explains our pivot model extensions, how they solve some outstanding problems, and provide further benefits for annotation processing. Finally, the discussion is summarised and future work is proposed in Section 5.

Section snippets

Sharing, converting, and re-using

When linguistic annotations are created the focus is often on the particular needs of a specific research project. However, facilitating the sharing of annotation data is an increasingly important goal of annotation. New research projects may involve comparisons with past projects’ data, building on the annotation work previously carried out, or merging different corpora into larger collections. Annotation sharing also encompasses the translation of data between two inter-operating systems.

The pivot form

This section describes various possible candidates for the ‘pivot’ form for central storage of annotations.

The goal is to be able to represent any kind of language annotation data, whether it be from the domain of phonetics, sociolinguists, discourse analysis, syntax, child language development, Natural Language Processing (NLP) or any other domain where language may be annotated and those annotations shared. Ide and Romary proposed a set of requirements for such a system (Ide and Romary, 2004,

Developing a format neutral annotation store

In the development of LaBB-CAT, we have found that Annotation Graphs have worked very well to represent a diversity of annotation scenarios. We have previously described the benefits of using this model in Fromont and Hay (2012). However, as discussed in Section 3.2.1, there are problems relating to the lack of specification about how annotations and annotation types relate to each other; this is why Annotation Graphs receive a × for Processability in Table 1.

The following section briefly

Conclusions and future work

Many of the corpus-sharing challenges posed by the diversity of annotation software and formats can be met by using a ‘pivot’ data model for annotation storage. Two-way conversions to tool-specific formats enable round-tripping of data through different tools without data loss, as annotations that are not representable within a given tool are nevertheless still represented in the central annotation store. Structure and alignment issues, ontological and granularity mis-matches, and convention

Acknowledgements

I would like to thank Florian Schiel and the two anonymous reviewers, who provided insightful comments and useful information. I would also like to thank Professor Jennifer Hay for her invaluable feedback while drafting this article, and her ongoing support and collaboration with the development of LaBB-CAT.

I would also like to acknowledge the LaBB-CAT user community for their ongoing feedback, collaboration and support.

References (65)

  • P. Boersma et al.

    Praat: Doing Phonetics by Computer [Computer Program]

    (2016)
  • J.W. Du Bois et al.

    Discourse Transcription

    (1992)
  • C. Fassnacht et al.

    Transana [Computer program]

    (2012)
  • H. Baayen et al.

    English Linguistic Guide for The CELEX Lexical Database

    (1995)
  • Barras, C., Boudahmane, K., Manta, M., Antoine, F., Galliano, S., 2008. Transcriber [Computer program]....
  • Bertin Technologies

    TranscriberAG [Computer program]

    (2011)
  • S. Bird

    NLTK: The Natural Language Toolkit

    Proceedings of the COLING/ACL on Interactive Presentation Sessions

    (2006)
  • S. Bird et al.

    Towards A Query Language for Annotation Graphs

    Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2002)

    (2002)
  • S. Bird et al.

    ATLAS: A flexible and extensible architecture for linguistic annotation

    Proceedings of the Second International Conference on Language Resources and Evaluation

    (2000)
  • S. Bird et al.

    A formal framework for linguistic annotation

    Technical Report MS-CIS-99-01, Department of Computer and Information Science, University of Pennsylvania

    (1999)
  • S. Bird et al.

    A formal framework for linguistic annotation (revised version)

    Speech Communication - Special issue on speech annotation and corpus tools archive

    (2001)
  • J. Blumtritt et al.

    Poio API and GraF-XML: a radical stand-off approach in language documentation and language typology

    Proceedings of the 2013 Balisage: The Markup Conference

    (2013)
  • BNC Consortium

    British National Corpus (BNC) XML Edition

    (2007)
  • L. Bombien et al.

    The EMU Speech Database System [Computer program]

    (2012)
  • L. Burnard et al.

    TEI P5: Guidelines for Electronic TextEncoding and Interchange

    (2012)
  • J. Carletta et al.

    The AMI meeting corpus: a pre-announcement

    Proceedings of the Second International Conference on Machine Learning for Multimodal Interaction

    (2006)
  • S. Cassidy et al.

    The alveo virtual laboratory: a web based repository API

    Proceedings of the Ninth Language Resources and Evaluation Conference (LREC 2014)

    (2014)
  • S. Cassidy et al.

    Multi-level annotation in the Emu speech database management system

    Speech Commun.

    (2001)
  • O. Christ et al.

    The IMS Corpus Workbench: Corpus Query Processor (CQP) User’s Manual

    (1994)
  • L. Clark et al.

    “kia ora. this is my earthquake story”. multiple applications of a sociolinguistic corpus

    Ampersand

    (2016)
  • M. Cochran et al.

    Report from TILR working group 1: tools interoperability and input/output formats

    Proceedings of the 2007 Working Group Report from the “Toward the Interoperability of Language Resources” Workshop

    (2007)
  • T. Declerck

    A framework for standardized syntactic annotation

    Proceedings of the 2008 International Conference on Language Resources and Evaluation (LREC 2008)

    (2008)
  • Dirk Roorda

    LAF-Fabric [Computer program]

    (2016)
  • C. Draxler et al.

    Speech processing tools – an introduction to interoperability

    Proceeding of the 2011 INTERSPEECH

    (2011)
  • S. Evert et al.

    Twenty-first century corpus workbench: updating a query architecture for the new millennium

    Proceedings of the 2011 Corpus Linguistics Conference

    (2011)
  • R. Fromont et al.

    ONZE Miner: the development of a browser-based research tool

    Corpora.

    (2008)
  • R. Fromont et al.

    LaBB-CAT: an annotation store

    Proceedings of the 2012 Australasian Language Technology Association Workshop

    (2012)
  • M. van Gompel

    FoLiA: Format for Linguistic Annotation Document, version 1.1.1 Revision 4.5

    Technical Report LST-14-01

    (2016)
  • M. van Gompel et al.

    FoLiA: a practical XML format for linguistic annotation – a descriptive and comparative study

    Comput. Linguist. Neth. J.

    (2013)
  • Hamburg Centre for Language Corpora

    EXMARaLDA [Computer program]

    (2011)
  • S. Hawkings

    Introduction to Xwaves+

    (2008)
  • U. Heid et al.

    A corpus representation format for linguistic web services: the D-SPIN text corpus format and its relationship with ISO standards

    Proceedings of the 2010 International Conference on Language Resources and Evaluation (LREC 2010)

    (2010)
  • Cited by (2)

    View full text