Why TEI Stand-off Markup Authoring Needs Simplification

Stand-o techniques are well documented in the TEI Guidelines and their use is encouraged. They are not widely adopted, however, except in their simplest forms, because they introduce considerable managerial overhead to a project’s workow. This article argues that stand-o markup is becoming increasingly relevant to a wider range of TEI projects, partly due to recent developments in the theory of textual modeling that strongly acknowledge the multidimensional nature of text. As a consequence, there is a need to considerably simplify the authoring, management, and processing of stand-o markup; but to what extent is this possible? Can tools make stand-o techniques more approachable without substantially disrupting workows familiar to TEI encoders? This article focuses on the complexities of stand-o markup authoring through

nally, (3) the TEI provides a catchall stand-o structure, the "Feature Structures" (<fs>) dened in chapter 18, for annotating any kind of textual feature (often linguistic) that is not represented by other TEI means. 2 6 Beyond these already established, fairly domain-specic TEI patterns, the need for stand-o markup is going to become gradually more pressing in a wider range of TEI projects, provides a technical answer to recent developments in the theory of textual modelling that have been strongly acknowledging the multifaceted nature of text. Patrick Sahle's theory of pluralistic text (Sahle 2013), notably claims that text can be viewed in dierent ways, for example as an idea, a work, a linguistic expression, as well as a sign and a material document. 3 Elena Pierazzo has proposed a multidimensional model of text, document, and work where each dimension is the function of a reader's selection of observable features and the context in which they are found (Pierazzo 2015, 40-54). Peter Robinson, as a reaction to a recent proliferation of digital editorial projects focusing on single sources, has remarked that editions should "illuminate both aspects of the text, both text-as-work and text-as-document" (Robinson 2013, 123). Invoking these positions, which one can hardly hope to summarize exhaustively in a few sentences, serves as a reminder that scholarly text encoding is determined by fundamental decisions about one's idea of text, and that competing views will emerge and require encoding. Indeed, the recent push for a multidimensional modeling of text is already generating projects that attempt to integrate competing representations of the same text. 7 The TEI has recently adopted a set of elements in direct competition with the OHCO-inspired representation of units of writing. The Guidelines chapter "Representation of Primary Sources" was substantially expanded in TEI P5 version 2.0.0 (in December 2011), following the proposal of a working group focused on genetic editing. 4 The resulting new elements and encoding strategies allow for a document-focused form of encoding, where concepts of page, opening, zone of writing, and lines are of foremost importance. These structures are useful to formalize an encoder's understanding of how text takes form on a writing surface, as well as tracking an author's actions on the page to identify textual revisions and the order in which they occurred. This has already sparked a debate on how to coordinate document-focused encoding with the more traditional (in TEI terms) text-focused encoding, often using text duplication or stand-o markup (see Brüning, Henzel, andPravida 2013 andMuñoz andViglianti 2015). 5 The two case studies that follow will exemplify this tension between competing representations of text and how some balance was achieved with stand-o markup. The rst case study describes the digital edition of the libretto of Carl Maria von Weber's opera Der Freischütz in the context of the now-completed Freischütz Digital project. The need for a stand-o critical apparatus of variants emerged because the encoding of each source included typographical and orthographical detail that made the use of embedded apparatus entries (also known as parallel segmentation) impossible. The second case is a report of work in progress at the Shelley-Godwin Archive (S-GA), an early adopter of the new document-focused encoding strategies for the representation of primary sources. In the S-GA, units of writing were initially encoded with milestones (empty elements between the units), but this course of action has proven to be unsatisfactory and too complex to maintain, and the project is currently experimenting with remote stand-o markup.
Both projects described make use of coreBuilder, a web application originally designed for the Freischütz Digital project to facilitate the authoring of stand-o markup in a visual environment.
The tool will be presented as an example (as opposed to an outright solution) of making the application of stand-o techniques more approachable in the context of projects dealing with multidimensional representations of text, without substantially disrupting workows already familiar to TEI encoders. Freischütz Digital (FreiDi) was a project funded by the Bundesministerium für Bildung und Forschung (Federal Ministry for Education and Research -BMBF) to create a digital critical edition of Carl Maria von Weber's romantic opera Der Freischütz. 7 The project set ambitious goals, including the digitization of all manuscript sources for both the score and the libretto, a new recording, and experimental audio-score alignment features. As a general principle for the digital edition, encoded diplomatic transcriptions were to be provided for all sources. For the musical ones, encoded using the Music Encoding Initiative (MEI) format, a diplomatic transcription meant dealing with source-specic aspects of: (1) textual components of music notation, such as performance instructions, dynamics, tempo marks, and lyrics; and (2) music notation symbols and conventions, including the use of brackets, tuplets, and articulation signs. The sources of the libretto, written by Friedrich Kind with some substantial interventions by Weber (see Weber 2007), were encoded in TEI "with a focus on the dramatic and lyrical structure of the text: the content is organized by scenes and verses; prose texts are distinguished and attributed to each actor; and stage directions and other descriptive text are identied as such. The transcriptions preserve original spelling and emphasis such as italicized and underlined text. Revised, deleted, and added passages are identied and marked up" (Viglianti 2016, 735). Since all sources were transcribed and encoded independently, creating a collation of variants required remote stand-o techniques. 8 FreiDi's approach to sources implemented a pluralistic view of text: the digital model needed to account for both the genetic and codicological dimensions of source-specic details, as well as textual variance and, more generally, the work.

10
The "parallel segmentation" method for encoding variants described in the TEI Guidelines (chapter 12) could not be applied to the TEI-encoded libretto sources of FreiDi because, according to this method, textual variants are encoded directly in the text at phrase level. The TEI also oers a stand-o method for encoding variants, called "double-endpoint attachment," in which variants can be encoded away from the base text by specifying the starting and ending points of the lemma of which they are a variant. This allows encoders to refer to overlapping areas on the base text.

11
While more exible, this method was not ideal for FreiDi's model because it assumes the existence of a base text, in terms of which the editor records variants from other sources; this makes it impossible to preserve textual aspects of the various sources such as spelling, abbreviations, or emphasis without identifying these aspects as substantial variants themselves. For example, if the character name Agathe were underlined in one source and not in another, one would have to create a new apparatus entry to record such a dierence: <speaker> <app> <rdg wit="#W1">Agathe</rdg> <rdg wit="#W2"><hi rend="underline">Agathe</hi></rdg> The solution for FreiDi's libretto edition involved the creation of a separate apparatus le that encodes textual variance with <rdg> elements containing pointers to markup in the encoding of the sources. This approach has a lot in common with collation les generated after an alignment step in software such as Juxta and CollateX, 9 but it is designed to operate at more than one level of tokenization, so that statements about variation can be attached to any element in the TEIencoded sources. To briey illustrate this model, let us consider the following verses from two libretto sources and the corresponding apparatus le entry. 10 Source KA-tx15.xml Source A-pt.xml <l xml:id="KA-tx15_l1">Sie erquicke, </l> <l xml:id="KA-tx15_l2">Und <w xml:id="KA-tx15_w1">bestricke</w> </l> <l xml:id="KA-tx15_l3">Und beglücke,</l> <l xml:id="A-pt_l1">Sie erquicke, </l> <l xml:id="A-pt_l2">und beglükke </l> <l xml:id="A-pt_l3">und <w xml:id="A-pt_w1">bestrikke.

15
The editorial team thought it necessary to also represent, as much as possible, units of writing more typical of text-focused TEI encoding, both to provide users of the S-GA with a less noisy "reading view" as well as to facilitate interchange with other TEI documents. 11 As in the FreiDi example above, there is a tension between dierent and competing views of the text that need to be reconciled.
In this case, too, stand-o oers a way forward.

16
Units of writing in S-GA are currently encoded as a secondary hierarchy using <milestone> and <anchor> elements that mark the beginning and the end of the unit. The @unit attribute is used to indicate the corresponding TEI element: <milestone unit="tei:p" spanTo="#ap1"/> <!--document-focused encoding --> The solution that is currently being developed promotes the text-focused secondary hierarchy back into the <text> element. Instead of duplicating text content, however, stand-o references are used to include content from the document-focused encoding, which remains the primary form of encoding, as shown in the following example. This approach will simplify the encoding of more complex textual structures without overcrowding the document-focused hierarchy; validation will also be simpler and more eective.
Arguably, the document-focused encoding could similarly include textual content with stand-o elements from a "plain" text document containing just strings of characters. This approach has been suggested in a number of publications (e.g., Schmidt and Colomb 2009), but we are deliberately avoiding it because we argue that plain text is not suciently devoid of interpretation or markup. Rather, we designate our document-focused encoding as the primary encoding that other representations can target with stand-o elements.

20
While being more expressive and potentially easier to use than the previous milestone-based model, creating and managing the pointers introduces complications typical of managing stand-o encoding.

Facilitating Stand-off Markup Creation with coreBuilder 21
These two case studies show that stand-o techniques can be eective for encoding text according to a pluralistic, multidimensional modeling approach. They also highlight at least two challenges for the manual creation of stand-o markup: (1) entering pointers as string identiers is errorprone; and (2) copy-pasting identiers can be a slow process, particularly when there are multiple documents involved, exemplied in the case of the libretto edition in FreiDi. Good authoring tools may help produce solid stand-o markup, but tools that are customizable, non-domain-specic, or oering TEI support, are not readily available. 12

22
The coreBuilder tool was developed to address this gap. 13 Originally developed for FreiDi's libretto edition (Viglianti 2016, 738-39), the tool is a web application for creating stand-o markup in a visual environment. Encoders can open XML les from their local machine or from the web, 14 set the stand-o elements and attributes that the tool should create, and nally click on XML elements with identiers (@xml:id) to create stand-o entries. The tool automatically creates the links using the identiers, thus reducing human error. The visual interface is designed to eliminate the need for scrolling between dierent parts of the same le, or moving between dierent windows while locating target elements and creating the stand-o elements. The stand-o entries are stored in an XML le (or "core") that can be downloaded and integrated into a TEI encoding.  Figure 1 shows the use of coreBuilder for creating a stand-o apparatus entry. There are two TEI les opened in the tool, each encoding a dierent source document of the same work. Users can create stand-o entries by clicking on elements with @xml:id attributes, or by selecting text to create XPointer expressions. 15 The apparatus entry created in the example identies a variant with two readings that invert the words "bestricke" and "erquicke" in two consecutive verses. The remote stand-o approach allows encoders to preserve dierences in punctuation between the two sources without encroaching on the critical apparatus. Figure 2. A list of stand-off entries ready to be downloaded as XML.

24
The earliest version of coreBuilder that was used for FreiDi relied exclusively on element identiers: FreiDi encoders, therefore, were also tasked with introducing elements such as <seg> and <w> to allow the stand-o apparatus to refer to the text at the right point (see example described in section 2). The latest version of the tool also supports the creation of XPointer expressions by highlighting and selecting ranges of text directly on the XML document. These make it possible to refer to arbitrary ranges of text in a TEI document, as dened in chapter 16 of the Guidelines. The XPointer functionality is adapted from tei-xpointer.js, 16 a JavaScript library by Hugh Cayless, who also contributed to a general overhaul of the TEI XPointer scheme denition (see Cayless 2013).

25
Using XPointer avoids cluttering TEI documents with unnecessary markup and provides a great level of exibility and granularity. This approach has one important drawback, however: any small change in the target encoding will likely "break" the XPointer reference. Consider, for example, correcting the typo in the word "exmple" to "example" and consider also that this text is marked up with a <w> element that is targeted by stand-o markup. Correcting the typo would not require an update of the stand-o entry because it targets the <w> element as a whole. If the target were expressed with an XPointer string range, however, the extra "a" would cause a shift in the character count and introduce an error in the stand-o element that could very well go unnoticed. Practically speaking, even though a tool such as the coreBuilder is able to simplify the creation of stand-o markup, the managerial overhead remains somewhat problematic, particularly when references involve XPointer expressions with string ranges that cross hierarchies.
The coreBuilder may not be a complete solution to all the issues related to stand-o markup, but hopefully it can serve as an example of an application for simplifying stand-o markup authoring. 17 The tool provides the following features that we argue are essential for a stand-o markup tool to be useful and eective in a workow where TEI is created by hand.
• It provides a visual environment to create references to XML identiers by pointing and clicking (or equivalent touch-enabled operations).
• Users are able to set the stand-o elements that they need to create (see gure 3).

•
The XML is not hidden from the user; the tool strives to be useful to TEI encoders at all levels, including experts who may nd working with hidden or rendered XML a complication rather than an aid.
• Users can create TEI XPointer expressions instead of ID references when required. It is also tempting to argue for a deep integration with the web and a browser-based Whether browser-based, or part of a desktop application, the complexities surrounding the use of stand-o markup in a TEI project will not be solved by one simple tool; parsing and processing stand-o are also less straightforward than other operations in XML. The main obstacle to greater adoption, however, is likely to be authoring because, unlike other steps, it needs to be done by hand, which is still the expected way of authoring documents in many TEI projects. Solving the problem of creating stand-o markup may simplify and aid projects that already make use of stand-o markup for analysis and annotation, but more importantly it has the potential to lead to a new wave of experimentation in TEI encoding. Modeling texts according to a pluralistic, multidimensional ontology of text is already largely possible with TEI, particularly after the expansion of the "Representation of Primary Sources" chapter and the introduction of document-focused elements.
The limitations on doing so at this point are more practical than conceptual.
2 Some inline elements also use pointing mechanisms to connect the starting and ending points of a textual feature. The elements <addSpan> and <delSpan>, for example, can be used to identify textual revisions across units of writing such as a paragraph. Unlike stand-o markup, however, these elements cannot be extrapolated from the text ow because their placement indicates the starting and ending point of the encoded text.
3 Sahle organizes these dierent notions of text as spokes on a wheel, which eectively positions them as supporting components of a whole.