Building, Encoding, and Annotating a Corpus of Parliamentary Debates in TEI XML: A Cross-Linguistic Account

This paper introduces an integrative and comprehensive method for the linguistic annotation of parliamentary discourse. Initially conceived as documentation for a specic and small-scale research project, the annotation scheme takes into account national specicities and is geared to proposing an annotation scheme that is both highly standardized and adaptable to other research contexts. In this paper we present a specic application of the Text Encoding Initiative (TEI) framework applied to a subset of ocial transcripts of plenary proceedings in three parliamentary cultures. The TEI annotation scheme proposed here has two main applications: rst, it serves as a basis for encoding parliamentary corpora by providing a systematic way of annotating both

specialized corpora, thus showing what can be annotated for specic research purposes and with limited means. Moreover, small-scale annotation schemes oer other advantages: for example, the possibility of encoding the variable majority/opposition, which had not been implemented otherwise prior to the annotation scheme presented here (Truan 2019, 45), as it needs to be done manually.
14 The variety of sources and formats is a strong point in favor of a common annotation framework.
All the texts have been retrieved from the ocial websites of the respective parliaments: • http://hansard.parliament.uk/ for the House of Commons; • http://pdok.bundestag.de/ for the German Bundestag; • http://archives.assemblee-nationale.fr/ for the Assemblée nationale. 6

15
Both the British House of Commons and the French Assemblée nationale display the parliamentary proceedings in HTML, which allows for a quick, easy, and accurate retrieval of the content. The German corpus, on the other hand, is based on PDF les. PDF les are noticeably less adequate for further encoding and tagging. In this case, the les have sometimes suered from inadequate word breaks, thus necessitating minor corrections. 16 We carried out the process of encoding in TEI by combining manual and automatic processing workows, with the idea of keeping both the content and the metadata of the sources. In particular, we used the GROBID software suite, 7 which provides a relatively ecient transformation process from PDFs to a decent TEI format, although not fully compliant with the target encoding scheme.
Attention was given to unifying the nal format across the three languages and parliamentary settings so that the same phenomena and features would be encoded exactly in the same way for each sub-corpus.

Small Monolingual Corpora as the Basis for a Cross-Linguistic
Perspective 17 The rationale behind the constitution of "small monolingual corpora" 8 (Koester 2010) is to allow for the interaction between statistical measures and a close-reading analysis that is sensitive to the sociopolitical context in which parliamentary interaction takes place. In order to ensure that external variables that may shape parliamentary talk are assessed appropriately, the research project that builds the basis for the annotation scheme focused on a limited range of national debates concerning a major European Council meeting (see Truan 2021, chap. 4).

18
Despite their high degree of conventionality, parliamentary debates involve a wide range of dierent activities (or subgenres) such as ministerial statements, speeches, debates, oral/written questions, and Question Time (Ilie 2006, 191). In order to capture a wide array of speakers and to ensure thematic continuity, Naomi Truan selected one plenary debate per year held between 1998 and 2015 in the British, German, and French national parliaments, respectively, about a major European Council meeting (either before or after the meeting or on the same day). As Auel and Raunio (2014, 17)  While the annotation scheme described in this paper presents typical features of parliamentary interaction, it also represents a rst step toward integrating contrastive perspectives while developing an annotation framework. The advantage of the comparison pertains to its heuristic value: by reecting on similarities and dierences during the annotation process, we come closer to an architecture that is valid and applicable to a large variety of linguistic data and metadata (see also Truan 2019 for methodological reections on contrastive discourse analysis).

Preventing the Built-in Obsolescence of the Corpus 21
In this section, we outline the principles guiding the documentation of the corpus and show how the choices we made are intended to serve general purposes. We argue that annotating corpora cross-linguistically calls for a very exible annotation framework that allows for multiple, expansible, and evolving annotations that may change over the course of time-a principle that is deeply rooted in the TEI. In order for this paper to be received outside the TEI community as well, we rst briey present the TEI Guidelines and show why they are deemed to be appropriate for parliamentary debates. We then link this general framework to what we call a sustainable corpus.

The TEI Annotation Scheme 22
The Text Encoding Initiative (see Romary 2008) has become, since its inception in 1987, the reference technical standard for the representation of textual content in the humanities. Based upon the W3C XML recommendation, it covers a wide range of genres and provides users with a vocabulary of nearly six hundred XML elements. At the core of the TEI Guidelines resides the principle that any TEI-based project should dene its own subset (or customization) where the elements which are deemed useful for the representational task at hand are selected, documented, and possibly amended.

23
In the context of small specialized corpora, TEI annotation can be used to store the "detailed information about the speakers or writers" (Koester 2010, 72). Linked with "the goals of the interactions or texts and the setting in which they were produced as part of the corpus database means that linguistic practices can easily be linked to specic contextual variables" (Koester 2010, 72). TEI XML annotation enables researchers to fruitfully visualize the articulation between text and context-that is, between the plenary session and the metadata associated with it.
Interpretative data is situated within the corpus using dedicated TEI elements. As we will detail in section 6, the corpus is available under a CC BY 4.0 license, which enables anyone to correct or extend the metadata if necessary.

24
Based on this general understanding, we have conceived the annotation framework with this contrastive research question in mind: the subset we have devised consists of elements that are deemed equally valid for British, French, and German parliamentary debates. We argue that the cross-linguistic view enables us to take into account national specicities while "emphasiz [ing] what is common to every kind of document," as Burnard (2014, "The TEI and XML") highlights for TEI. In this sense, and despite the fact that the political context changes over time between France, Germany, and the United Kingdom, TEI gives access to a common technical, practical, and methodological framework between the three subcorpora and the three languages.

A Sustainable Corpus 25
In designing the TEI-based encoding scheme of our corpus, we intended for it to be easy for other scholars to take it up to carry out various types of research, and also to allow its possible extension (in terms of coverage) or enrichment (e.g., with additional annotated features). Although we would avoid the term reference corpus, which is more applicable to large-scale endeavors to build up a representative sample for a language (see, e.g., Kupietz et al. 2010), we strove to create a sustainable corpus that may be combined in time and space with other endeavors to describe language resources in a variety of contexts and for a variety of genres. Within this framework, we saw adopting a sampling strategy focused on our research question not as a restriction in the constitution of the corpus, but rather as a route to a better grasp of the parameters for the linguistic analysis and thus for the encoding.  The (internal) debate within the TEI community as to which module can optimally deal with parliamentary corpora, Drama or Transcription of Speech, relates to a more essential question: how should parliamentary debates be considered as a scholarly source? Three arguments plead, in our view, for an annotation as transcription of speech rather than drama. First, when designing the annotation scheme, we were quickly set on identifying parliamentary debates as the tangible record of an observable interaction rather than a performance that could be derived from a preexisting script. Indeed, even if MPs may be reading from notes when participating in a parliamentary debate, seul le prononcé fait foi, that is, the transcription only records what is actually said.

28
Second, even if one could claim-following the theatrical metaphor-that MPs play a role, specically depending on their relation to the government (majority, opposition) or their specic positioning on certain political issues, we also observe speakers as concrete entities to which we can associate, as we shall see, concrete personal and sociolinguistic markers in the context of a given political speech. Finally, parliamentary debates display a wide range of phenomena pertaining to spoken (multimodal) interactions such as overlaps, interruptions, background noises, or applause, which may all be deemed to bear an interactional, if not political, meaning and thus cannot equate with blocking as indications pertaining to the staging of actors in order to facilitate the performance. Furthermore, MPs often depart from the script (at the British House of Commons, they are not allowed to read a text aloud). While the resemblance between parliamentary debates and theater is attested (Ilie 2003), there is always room for improvisation, unplanned reactions, interventions, or comments in parliament. It is true that some of these characteristics may not be transcribed by the ocial stenographers (see below for a discussion), yet they remain available.
Third, although some parliamentary records appear to be strongly edited and may be seen as very close to written prose or drama in style or structure, we think it would go against a general eort toward interoperability to adopt, for a subset of the general corpus of parliamentary records, an encoding strategy that would be dierent from what is needed for more ne-grained transcriptions. As a matter of fact, the tagset for the transcription of spoken language of the TEI Guidelines does not imply that all details from the source must be encoded and one can implement, with a very small subset of the corresponding elements, exactly what could be achieved when adopting an encoding strategy based upon the tagset intended for drama.

30
For these three main reasons, we have adopted a TEI annotation scheme distinct from drama.

Enabling Sociolinguistic Explorations: The TEI Header 31
The criteria for documenting the corpus are directly derived from the model sketched out in the rst two sections. In this section, we account for two levels of analysis underlying the annotation scheme: rst, the TEI header (<teiHeader> element), which stores information related to "the metadata associated with the digital document itself, analogous to the title page of a printed book" (Burnard 2014, "The TEI Header"); second, the transcriptions of speech within the <text> element itself (for instance, the distribution of turns). Our documentation strategy has been determined by our fundamental decision within our corpus to fragment parliamentary debates into document units corresponding to plenary sessions, with the additional advantage of optimizing the maintenance of the corresponding information within our corpus at large (e.g., allowing other researchers to easily complement the corpus with additional sessions, as independent TEI documents), as well as facilitating cross-session analysis.

Political
Hence, each TEI XML document corresponds to one plenary debate as a communicative unit, that is, a given spatiotemporal unit bound to a specic situation in which a group of given participants discusses a given topic (Kerbrat-Orecchioni 1990, 216), thus making the text the proper linguistic object under investigation. 14

36
We have chosen to identify the speakers in each debate in the corresponding header and not in each utterance (or prior to each utterance) for three main reasons: Usually, the <id> of a speaker corresponds to the last name. When speakers share their last names with another speaker of the corpus, as is the case here, the rst names are added. Another option could have been to add the date of birth for each speaker.

39
The rst group of features attached to the description of an MP within a plenary debate corresponds to stable-or very rarely varying-characteristics pertaining to the identication of the speakers according to long-term properties such as name (<persName>), sex (<sex> 16 ), and nationality (<nationality>). The content of <sex> allows for simple comparisons such as length of speeches by gender (see Truan 2021, chap. 4).

40
The second group of features is more specic to each plenary debate and corresponds to the political characteristics of the speakers: their political aliation (<affiliation>), their relation to current government (<floruit>, 17 with values "majority"and "opposition"), and the district that elected them (<residence>).

41
This approach allowed us to look into the corpus through variables that have not, as far as we know, been consistently integrated into the corpus-based and corpus-driven analysis of parliamentary discourse so far. We are able to gain insights into the relationship between opposition and majority in terms of person reference that otherwise would have remained hidden. For instance, referring to certains (some), for a member of the UMP (Conservatives in France), is likely to denote the Communists at the Assemblée nationale (see Truan 2021, chap. 7). Building categories of discourse participants is closely intertwined with the speaker's construal of who is included and who is excluded. Such a nding could only be attained through the exploration of the correlation between linguistic forms and manually encoded variables in the form of TEI constructs.

43
Although we have not encountered this situation in our corpus, it should be pointed out that even in the last group of features, a change can happen within a given debate, when for instance an MP changes sides. Such a scenario occurred with the creation of The Independent Group (TIG) in February 2019. In such cases, the exibility of the TEI toolkit would allow for a meaningful representation through the use of temporal attributes as exemplied in example 2.
Example 2. Exemplifying a change in political party within a plenary debate. As shown previously, each parliamentary debate constitutes a specic speech event taking place at one time and one place. The speech event constitutes a macro frame in which speakers, who alternatively become hearers as well, produce several turns. The contextual description of the speech event must thus contain the basic features that enable a user of the corpus to situate each utterance within a precise geo-temporal environment, but also to understand the broader political context.

45
The TEI Guidelines provide a suitable way to do so within the TEI header by using the <setting> element within a <settingDesc> element, whose usage we have adapted to match our purposes.
As illustrated in the example below, we have described the following features attached to a In this section, we present the decisions pertaining to the turn level section 5.1 as well as the intra-turn level section 5.2. Importantly, we do not address other levels of annotation such as word-level annotation that could have been marked up in TEI as well. Indeed, for the purpose of this project, the results are based on automatic part-of-speech tagging. The open-source software TXM 21 (Heiden, Magué, and Pincemin 2010) used in this project indeed proceeds to language-specic part-of-speech tagging when a corpus is imported.  Diwersy, Frontini, and Luxardo (2018) observe that the descriptor "speech type (debate, interruption, vote explanation, etc.)," which we do not use in our corpus annotation, proves to be "particularly important when it comes to dierentiate eects of register variation ranging from highly formulaic to less formal speech (as in the case of e. g. interruptions)." The main reason for not annotating this level of analysis is, once again, to be found in the contrastive perspective we adopt. Whereas interruptions are thoroughly transcribed in the ocial recordings of the Bundestag and the Assemblée nationale, enabling new research questions on the special kind of dialogue emerging during these interactions, unexpected or unauthorized turns at the British parliament are only indicated as interruptions with no further information provided on the nature, source, or content of the disruption, as in the following example:

The Representation of Spoken
Europe put in place after the second world war and I would include NATO as well as the European Union have played a role in making sure that we settle our problems around conference tables rather than on the elds of Flanders. To that extent, yes, I think that it is right. For the purpose of our corpus, we have not fully used the richness of the Transcriptions of Speech module of the TEI Guidelines, as described by Schmidt (2011). This is due to both the specic scope of the linguistic study that we were pursuing and the actual informational simplicity of the available sources. Still, the choice we made of using this module oers the possibility of a variety of potential enrichments, either by ourselves, or indeed by anyone who would want to further complement the corpus. The possibility to align with precision, but means of a timeline, the various turns, sub-segments or any kind of incident, oers the potential to have a better insight in the nature of the interactions carried out in parliamentary contexts, from a prosodic or gestural point of view for instance.

Documenting and archiving the data 53
The Text Encoding Initiative has been, right from the onset, the basis for a strong open science vision, where interoperability would be at the service of sharing and reusing digital content encoded according to the TEI Guidelines (for an overview, see Romary 2020). For this reason we provide here an overview of our eorts to make the corpus FAIR ("Findable, Accessible, Identiable, and Reusable"; Wilkinson et al. 2016).

54
As already alluded to, the corpus has been designed with the idea that it could be easily reused 7. nally, the ability to add an XSLT stylesheet to the corpus to provide a default search and presentation environment (in HTML).

55
Beyond the technical setting, we conclude with dissemination issues that, to our view, are an essential part of the annotation project. First, we considered that beyond seeing the corpus as reusable (linguistic) content, presenting the annotation framework as an ongoing process could also play a role as a methodological point of comparison for other comparable endeavors. As a consequence, we decided to distribute all the source documents rather than limiting access through, for example, a query interface, as is the case for the EuroParl corpus. Second, although there are often fears of being plundered when data is disseminated at too early a stage in a research process, the author who compiled the corpus as part of her dissertation project took the decision to have the data online even before the actual doctoral publication was available. 23

56
The three corpora are available online at the following addresses:

Conclusion 59
This paper has suggested an integrative and comprehensive approach to the linguistic annotation of parliamentary discourse that takes into account national specicities and is specically geared to proposing an annotation scheme that is both highly standardized and adaptable. The method is based on the TEI framework. We have argued that the linguistic features of parliamentary interaction call for an annotation scheme distinct from the ways theatrical plays have been accounted for within the TEI community. We have also pleaded for an easily reproducible cross-linguistic annotation framework. Specically, we have shown that including metadata such as political aliation or the distinction between majority and opposition is crucial to allowing for the comparison between several parliamentary systems. 60 We understand this paper as a rst step toward the annotation of parliamentary corpora on a larger scale. We recognize that the small size of the corpora (from approximately 137,000 tokens for the French corpus to 417,000 for the German corpus) allowed for ne-grained annotation that may be more dicult to implement on a larger scale. Accordingly, the application of this annotation scheme to a bigger corpus needs to be systematized. On the other hand, it would also be possible to further complement the detailed annotation scheme, for instance by providing timestamps and the hyperlinks to the videos, as suggested by Cribb and Rochford (2018, 13), "so that a user at a particular point in the report can link through to the audio recording eortlessly and accurately." A more precise linkage between the videos and the transcripts could also enable insightful annotation in terms of kinesics-a dimension which, arguably, would adequately complete a closereading discourse-analytic endeavor.

61
These further extensions and exploitations of the annotated corpora are at the core of our understanding of annotation as a process rather than a nished product (see also Bucholtz 2000) for a similarly reasoned argument in terms of "the politics of transcription"). In doing science in the digital age it is essential to make decisions explicit, transparent, and replicable. The annotation scheme developed in this project is only a rst step.
open-source, as it would be through such a platform as GitHub. We see GitHub, which is a private platform and thus does not fulll all our criteria of a sustainable environment, as a possible front end for the further development of such a corpus as ours, while keeping an environment such as Ortolang as the nal publication setting.