1 Introduction

The approach to scholarly editing used by historians differs from the approach used by literary scholars. Both historians and literary scholars share an interest in a good text created by textual criticism, as texts are the main sources on which historians draw in their constructions of narratives about history. Nevertheless, historians can have a slightly different approach to text: linguistic and physical aspects are considered mere intermediates to the information conveyed by the text. Historians consider the content of the text ‘data’, and they want to use this data in their research to gain knowledge about the past. The circumstances under which archival documentation as a major type of text with which historians work was created support their perception of text: people recorded administrative activities in text to preserve information about these activities for contemporary but absent clerks or for future clerks. In other words, they stored data in texts written on paper.

In pre-digital editorial practice this can lead to decisions which are unacceptable to literary scholars, such as paraphrasing parts of the text. I will try to show that in digital scholarly editing the approach to editing used by historians can be reconciled with methods of textual scholarship. I suggest calling this combined method ‘assertive editing’ to avoid the impression that this method can only be used by historians. The method of assertive editing is not defined by disciplinary interests but by an interest in one facet of text: the information recorded. In terms of Patrick Sahle’s text wheel (Sahle 2013:III,45-49), the assertive edition is the editorial practice dedicated to the ‘text as content’ perspective. In the following I usually will oppose this ‘content’ to the ‘text’ as pure transcription and the result of textual critical work.

2 Contributions to the assertive edition

Assertive editing is fed by two streams in pre-digital and early digital scholarship. The ideas of content-oriented navigation, the possibility of multiple forms of representation, and extensive historical commentary are drawn from pre-digital editorial practice in historical research. I will try to show this by presenting on three major German historical printed editorial series: The Monumenta Germaniae Historica (MGH), the Records of the Early Modern Imperial Diet (“Reichstagsakten”), and the Official Minutes from the Imperial Chancellery (“Akten der Reichskanzlei”: Bundesarchiv 1982).

2.1 Pre-digital contributions

To facilitate navigation and reception, the editions in the MGH Diplomata series prepends abstracts of the legal core to each document. This is common practice in European charter editions, and it was codified by a committee of historical editors under the direction of Robert-Henri Bautier in 1974 (Bautier 1976:13, 17). More recent Diplomata series editions dedicate a paragraph in the introduction of each document to the historical context (e.g. the charters of Emperor Frederic II by Koch 2002-2017). Some MGH editions of historiographical texts indicate the year to which the current text refers in the margins (e.g. Georg Waitz’s edition of the Historia Danorum Roskildensi, 1892:21-26). This helps the readers find the events in which they are interested.

Of course, abstracts serve more purposes than simple navigation. In the editions of the Imperial Diets, abstracts replace some of the documents (e.g. Heil 2014, 87-91). Dietmar Heil describes the interest of the editors: “The priority is … philological authenticity, but optimal accessibility” (2015, 29, trans. Georg Vogeler). They reduce historical orthography and change punctuation when it deviates from the modern syntactical analysis of the text (Heil 2015, 29-31). Editors of correspondence have also considered this approach (Steinecke 1982).

Editorial work in contemporary history is defined by the selection of significant material and contextualization of the text. The editors of the Minutes of the Cabinet of the German Federal Government (Bundesarchiv 1982), for instance, explain their selection by relevance of content, discarding as irrelevant for instance the agenda in the head of each minute, invitations, and their attachments. This content-oriented approach can be found in other editorial principles of this edition. Orthographic and syntactic errors are emended without notice, for instance. Single entries start with a heading, persons present, and the place and time of the meeting, not as a verbal copy but as an extract created by the editors.

The Minutes also serve as an example of the third element of pre-digital editorial practice. They add extensive notes on the subjects of the meetings to each transcript with the aim of making the texts understandable. This kind of annotation is not specific to this one edition, but is generally recommended in historical editing (Cullen 1981; Stevens and Burg 1997, 157). The edition of the Minutes of the Bundeskabinett serves primarily to illuminate government decisions, rather than their wording. Similarily, many MGH editors add extensive comments on the historical context, e.g. in the pre-publication of anonymous continuations of Frutolf 1101-1106 (Marxreiter 2018).

These approaches have been directly transferred into electronic editions. The idea of facilitating the understanding of the text accepts translations as a way of editing. This leads to solutions like David Postles’ online representation of Stubbington medieval records (Postles 2011), which gives the text in a translation of the original Latin. This not an individual practice, as P.D.A. Harvey discusses in his introduction to historical translation as a method in editing (2001, 31-32). From a historian’s point of view, a translation is a sensible solution, as it facilitates the use of the document. It would not satisfy the research interests of textual scholars.

Paul D.A. Harvey argues that the edition of historical records can be reduced to a calendar of abstracts when the original or photocopies of the records are easily accessible (2001, 56-59). Several project follow an approach of this kind. The Records of the Swiss Foreign Office (Zala et al. 1978–2018) replaces the transcription with images. This calendar plus image approach is also used by Soundtoll registers (Veluwenkamp and van der Woude 2009; Gøbel 2010) and Peter Rauscher and his colleagues in the Donauhandel project (2008-2018). Both create databases with structured information directly from the source and link it to images of the source.

2.2 Early digital contributions

Historians’ interests in the ‘facts’ and the dominance of sociologic approaches to history in the 1960s to 1980s led them to create ‘databases’ of historical information (Boonstra et al. 2004). A famous example of this approach is the Online Catasto of 1427 (Herlihy et al. 2002), an online edition created by R. Burr Litchfield and Anthony Molho based upon David Herlihy and Christiane Klapisch-Zuber’s project Census and Property Survey of Florentine Dominions in the Province of Tuscany, 1427-1480 (1978; Herlihy 1964; Herlihy 1967). The data keeps close to the source, copying the information on wealth recorded for each taxable household in the city (as it is found in the initial tax declarations of 1427 plus additions and adjustments made in 1428 and 1429). Seeing historical records as an accidental medial solution to preserve and process information, one could consider this database a simple change in recording medium, not in information itself. The needs of the recording medium require substantial changes in the recording method. Herlihy / Klapisch-Zuber had to create new encodings and had to break the text rigorously into table columns. In the end, the database tries to recreate the information recorded by the Florentine officials, addressing three essential questions: who had to pay what amount of taxes for which kind of property.

Philological editors certainly cannot consider this database an edition. The encoders did not copy family names, names, and patronymics letter by letter, but standardized them and truncated them when they went beyond ten letters. Historians were well aware of the modifications that database encoding made to the original records. In the 1970s, however, digital scholarly editing was not yet developed enough to provide a solution. The concept of scholarly editing does not even appear in the more recent book on Historical Information Science by Lawrence McCrank’s (2002). At the time, computing methods in the historical sciences chiefly meant the production of relational databases and spreadsheets.

In the 1980s, Manfred Thaller proposed a historical database system that kept closer to the original source (1980, 1988, 1992, 1993). He developed the Clio database as a ‘source oriented’ database. It would reduce the amount of encoding and transformation of the source customary. Clio kept as much information from the source as possible by allowing for hierarchical organization of information, better representation of incomplete data, and integration of alternatives and comments. This source-oriented database approach is clearly a type of editorial work, combining text from the source with interpretation by and for historians. At the same time, a philologist would regret the lack of a full transcription.

3 Digital editions and facts

Digital scholarly editing has developed since the days of Clio and has built upon the methods developed for the MGH, the Reichstagsakten, and the Akten der Reichskanzlei. The assertive edition developing out of these strands is something between pure textual representations and well-formed databases structured around specific research questions. No edition yet calls itself an assertive edition, but many bear features that fit the definition put forward here. A selection may be found by searching Patrick Sahle’s catalogue for “general subject area: history” (Sahle 2008–2017). Browsing through the projects on the list, one can identify four major questions:

  1. 1.

    Which interface elements are typical for an assertive edition?

  2. 2.

    How can we use automatic information extraction processes in the scholarly edition?

  3. 3.

    Is semantic markup (provided by the TEI) sufficient?

  4. 4.

    How can we integrate the Web of Data (the ‘Semantic Web’) into scholarly editions?

3.1 Interface elements

Editions like the letters of Alfred Escher (Jung 2012–2018), the Acta Pacis Westfalica (Lanzinner and Braun 2014), and the Diplomatic Correspondence of Thomas Bodley 1585-1597 (Adams 2011) offer avenues of access to the text beyond the pre-existing textual structure. Typically, tools include indices of persons, places, and subject keywords. Other entry points to the texts show better what an assertive scholarly edition would concern itself with: APW, for instance, gives access via a timeline of events, a calendar of relevant dates, and a map. Indeed, indices of persons, places, and events and calendars and maps are fast becoming default components for historical digital editions. Additional fact-oriented interface elements seem to depend more on the type of documents edited: rich prosopographical information like in correspondence suggests using network visualisations, for instance in the diplomatic correspondence of Thomas Bodley (Adams 2011, visualisations). Economic information suggests the use of bar charts to visualize income and expenditure, as in the case of the edition of the municipal accounts of Basel 1535-1611 (Burghartz 2015, Konten). The latter builds upon the source-oriented database approach advocated by Manfred Thaller by allowing the user to select entries from the accounts and collect them in a ‘data basket’ (Burghartz 2015, databasket). This allows the user to perform basic arithmetic operations and download the results as a spreadsheet. Finally, semantic networks like those used in Burkhardt Source (Ghelardi et al. 2015) hold some general promise, but for the moment they remain lonely solutions for single projects.

3.2 Information extraction

The user interfaces, of course, are only the surface of the edition. How does one harvest information? What form does the information take as digital data? Which models relate the information to the transcription? One approach to data harvesting from texts is automatic information extraction. Computer linguists have been working on this since the 1950s. Their goal is to reduce free prose text to answers to the questions “Who did what to whom and when?” and represent these answers in a structured way. A typical information extraction pipeline starts with generic Natural Language Processing steps and then uses Named Entity Recognition to mark up the words representing persons, locations, or organizations, temporal data, and quantifying data. The pipeline then relates these entities to one another, building connections between the entities. This can take the form of predicates in sentences, coreferences by pronouns, etc. The possible relationships can be inferred from external knowledge about the domain, like dates of birth and death for people mentioned in a text, or it can be the result of the semantic role, such as can be inferred from the predicate in a sentence. The task is very domain-specific, as it depends on what type of information is considered relevant. A typical task for historical research could be event extraction, which is already applied to automatic news analysis (see Grishman 2015 for a general introduction).

Recent projects dealing with US foreign affairs records have taken this approach to transcripts of archival documents. They take the historical records as source data without any intermediate scholarly processing. Using OCR to create a digital representation of the text, scholars then apply distant reading methods like topic modelling or information extraction to this corpus (e.g. Kaufmann 2014–2018). Gao et al. (2017) have used even used the electronic texts of the cables in the 1970s for their computer-based analysis.

The aim of implementing this approach in scholarly editions would be to create a reliable text with classical textual criticism and to extract from this text the information for historians. Existing information extraction methods are built for modern texts, and thus they have to be modified to be applicable to historical texts or historical texts have to be modified to come closer to modern texts. Piotrowski (2012) has described the many challenges in this task. Some progress has been made e.g. in the handling of variants in historical language, for instance by Bryan Jurish (2008, 2010, 2011, 2013) or Kestemont et al. (2017). However, most of the problems still remain to be solved. Scholarly editors still have to rely on their own competence and on human labour for the introduction of substantial knowledge about what people in the past wrote in their texts.

3.3 TEI and semantic markup

The problems computers still have with historical languages led to the decision to create manually annotated texts. Digital editions use the extensible mark-up language XML to add semantic markup to texts. This is made possible in particular by the strong connection between the communities maintaining the guidelines of the Text Encoding Initiative (TEI) with the community of digital scholarly editors. TEI provides semantic annotation for many phenomena interesting to historians: names of persons, locations, or organizations can be encoded as <name>, temporal expressions as <date> and <time>. With the TEI P5 there are even guidelines concerning how to encode structured descriptions of persons, places, and events, structures that are similar to database structures. Still, the markup provided by the TEI is deficient in ways of expressing historical information of interest in this present study. An example is the <event> element.Footnote 1 The TEI guidelines consider it a concept independent from text, to which text can refer. An expression like ‘my inauguration’ in ‘after my inauguration, I decided to leave the town’ is not an ‘event’, but should be encoded like any other referring string with the <rs> Element. Nevertheless, while places and persons have a dedicated <persName>/<placeName> tagging, historians interested in marking up named events like ‘World War I’, the ‘battle of Marathon’, the ‘coronation of Charlemagne’, the ‘Contract of Maastricht’, or the ‘Lisbon Earthquake’ in their sources have to employ workarounds. This observation illuminates the distance between a major practice in digital scholarly editing and the research interests of historians. One reason for this might be that scholars much more easily agree on the identification of individual names of concrete persons, places, and organizations than on more abstract events. The sample events above have formal names (some, more than one), but text often describes events in a much looser way: ‘my inauguration as bishop’ is clearly an individual event, but one unlikely to have formalized name. Many events do not even bear names at all. Rather, they are told as a story: ‘When Hitler’s troops crossed the Polish border on September 1 in the year 1939, World War II started.’ This sentence clearly refers to the event ‘Nazi invasion of Poland’, but could just as easily be referred to as ‘Start of World War II’, or in many other ways. This example demonstrates that even these short identifiers are not just an arbitrary ‘name’. They create different contexts and are therefore part of a specific discourse.

3.4 Web of data: semantic markup by reference

Linking different names for the same event is a typical competency of Semantic Web technologies as proposed by the W3C since 2001 (Berners-Lee et al. 2001; rebranded as ‘Web of Data’ activities by W3C 2013). The Semantic Web uses abstract unique identifiers (URIs) as representations of the concepts covered by the name. With URIs, scholars can create digital representations of events without relying on ambiguous natural language terms. An increasing number of digital scholarly editions use Semantic Web technologies to solve naming issues. The most prominent method is the extension of classical indices: while previously, such indices standardized names to represent the historical fact behind a name for a person or a place, URIs allow identifying persons, places, and organisations for technical processing, even if there is no name. Gautier Poupeau described this approach in 2006, and the digital edition of the Fine Rolls of King Henry III, created 2005–2011 (Ciula et al. 2008), made extensive use of these technologies in its back-end. A good example of the use of Semantic Web technologies in scholarly editions is the Teutsche Academie der Bau-, Bild- und Mahlerey-Künste, by Joachim von Sandrart (Kirchner et al. 2008–2012). The text refers to many artists and artistic objects, which are identified and described in the index and can be downloaded from the site as an RDF dataset.Footnote 2

A more extensive formalisation than the index approach is demonstrated by the Old Bailey project (Hitchcock et al. 2003–2018). The basic transcription of the text was annotated in XML in order to facilitate structured searching and statistical analysis. This approach works because the records already tend to have a regular structure. ´The meaning of particular words or phrases like names and crimes is tagged and further sorted into subcategories like types of verdict.Footnote 3 The final encoding of the texts contains formal descriptions of the relationships established by the markup. They are processed in a separate database, but they are also kept together with the text in the XML. Old Bailey online is not just a database of criminal trials, but an assertive scholarly edition, representing the statements made by the transcription in a formal way and linking the statements to transcription, to image.

Following Semantic Web / Web of Data activities of the W3C, the digital representation of data is increasingly realized through RDF triples. In the context of the assertive edition, they have the advantage that they model facts as statements about reality in a simple but expressive way, as they can be read as subject predicate object propositions.

Parallel to the development of embedded annotation with XML, digital humanities has developed methods for stand-off annotation. Since 2001, stand-off annotation has been increasingly realized with RDF. A standard for this annotation has been found in the Open Annotation vocabulary (Sanderson et al. 2013). Digital editions have made use of this possibility. Pundit (Grassi et al. 2013; Morbidoni and Piccioli 2015; Andreini et al. 2016; Net7) is the most advanced application of the Semantic Web to digital scholarly editions, used for example in the scholarly edition of the correspondence of Jakob Burckhard (Ghelardi et al. 2015). It allows annotation of any part of the text. Textual fragments can be used as the subjects or objects of an RDF triple. In the Jakob Burckhard edition, Pund-it reduces the possible predicates to references to artworks and artists, general comments, quotations and references, dates, and geographical identification. Pundit saves this as an RDF reference to the HTML elements. Work is underway on linking the annotation directly to the XML/TEI source.Footnote 4 In the end, the semantic networks, which are a unique interaction feature of this digital edition, can describe the content of the text through direct links to part of the source text that contains the information.

3.5 How to combine transcription with databases?

Looking forward, a number of questions arise: can we build scholarly editions which include results similar to those created by information extraction software but controlled by hand, thus bringing the full power of human understanding to the annotation? Can we encode the propositions made by the words of a text into the transcription? Can we embed the statements extracted by the reader into the sequence of characters and thus create a single digital resource representing transcription and information conveyed by it to the editor? If so, how?

One possible approach is suggested by RDFa, the W3C’s proposed serialization of RDF embedded in HTML markup. It provides attributes for HTML elements describing RDF triples. Existing HTML element attributes like @href or @src can be used as objects in the ‘subject predicate object’ triple structure. Additional attributes like @typeOf, @resource and @property permit some of the full expressiveness of RDF.

figure a

Listing 1a: example of a sentence in RDFa encoding

figure b

Listing 1b: triples extracted from the sentence (in Turtle/N3 notation)

Listing 1 demonstrates which triples can be extracted from a sentence in a fictive letter by using RDFa markup as semantic annotation. This method is attractive, as it closely relates the assertive expression to the text

How might TEI be similarly extended? The standardized mark-of the TEI covers some typical basic facts that might be extracted from texts. However, assertive annotation can be much richer and highly diverse. Something more flexible is needed. Therefore, I would suggest transferring the RDFa approach to TEI, creating a ‘TEIa’ annotation style. The TEI-community has already discussed the idea of directly importing the RDFa attributes into the TEI (TEI-Community 2010, 2014), but it was argued convincingly that foreign namespace is not controllable by TEI and therefore not recommended. Fortunately, TEI provides attributes which cover much of a TEIa approach: @ref creates a link from a verbal expression to an entity, and @ana links textual fragments to any kind of analytical annotation. As the TEI guidelines reduce the use of @ref to reference strings, the globally usable @ana seems to be the best candidate for a generic linking of textual fragments to RDF triple structures describing the relevant facts.

The Système Modulaire de la Gestion d’information Historique (SyMoGIH), which was developed by Francesco Beretta and his team (Beretta and Vernus 2012; Beretta et al. 2016), makes use of RDF-based semantic markup in combination with TEI transcriptions.Footnote 5 In the edition of the Journal of Léonard Michon (Letricot 2017), for instance, the transcribed texts are accompanied by a marginal index with short notes on events, facts, and persons. They are formalisations of the very text, e.g. ‘Le Roy luy a envoyé à Marseille Monsieur de Saint Olon, gentilhomme ordinaire, qui le suivra jusqu’à Paris’Footnote 6 is represented by a descriptive text ‘François Pidou de Saint Olon accompagne l'ambassadeur perse de Marseille à Versailles’ and the people involved in the event. This information is represented as an RDF statement (http://symogih.org/resource/Info116905) about the two persons involved. The annotation is encoded with the TEI, and the global attribute @ana links the text to a database of the formalized description of the content (Beretta 2013).

Other examples of this approach are found in projects realized at the Zentrum für Informationsmodellierung at University of Graz in cooperation with the Historical Department at the University of Basel. Susanna Burghartz’s team created transcriptions of two sets of administrative records from the city of Basel from the Early Modern Period: the annual accounts of the city from 1535 to 1611 (Burghartz 2015) and a criminal court record, the ‘Urfehdebuch’ (register of oath of truce) from 1563-1569 (Burghartz et al. 2017). While Digital Humanities projects related to the Early Modern Period focus very often on handling the specific properties of Early Modern texts (Nelson and Terras 2012; Estill et al. 2016), the Basel editions can be considered assertive editions. Both projects are realized in a very flexible technical environment, the GAMS (Steiner and Stigler 2014–2017), which is a framework for archiving and publication of humanities data sources, in particular digital scholarly editions.Footnote 7 In the Jahrrechnung der Stadt Basel the core information unit addressed is clear: the monetary amount of a single transaction, as transmitted by the historical accountant, i.e. his rubrics (Vogeler 2015a; Vogeler 2015b; cfr. Vogeler 2016 for a deeper discussion of editorial methods appropriate for historical accounts). However, even this simple criterion needs interpretation: repaid loans and interests are mixed in one common category of income. For a financial analysis this is unacceptable; accordingly, stand-off annotation is used to apply sub-categories to individual entries. In the case of the Urfehdebuch the main category is, as is true of the Old Bailey records, a single case. At least one core property of the data structure is already represented by the textual structure in the archival manuscript: the heading gives the name of the offender. However, type of offence, victim, and punishment have to be extracted from the text and are encoded with links to a taxonomy developed for the project (Pollin and Vogeler 2017).

Embedding the interpretation of facts into the text seems straightforward, but it has several drawbacks. The examples above have shown that we need at least links to external knowledge organization systems and the translation into full RDF triples to express enough of the content of the texts. Information science teaches us to go even further and to include time in the relationship between data and information, i.e. between edited text and the facts the historian considers to be represented by the text. Börje Langefors (1966/1973) has formulated his ‘infological equation’, according to which the information is a function of the data, the recovering structure, and the time when the interpretation takes place. This conceptualization of information argues in favour of stand-off annotation, as the semantic value of an edited text is an interpretations by the editor. In fact, there is a long-standing discussion in the text encoding community on the risks of embedded semantic markup, summarized by Thaller and Buzzetti (2012). Standardized technical solutions for this approach do not yet exist. RDF has established itself as a common data structure for the exchange of the factual interpretations of text. The question of how to maintain the linkage between the text edited and the RDF is still under discussion.

4 Conclusion

All of this leads to the following description of assertive editions: they are scholarly representations of historical documents in which the information on facts asserted by the transcription is in the focus of editorial work. They help the user/reader understand the text and use the information conveyed in the text as structured data. This data includes interpretations of the text based on the context and the expertise of the editor. In fact, interpretation is part of the core of the critical activity of the editor. She concludes on the basis of her knowledge about the written text, its layout, and the historical circumstances under which it was produced how to describe the content beyond pure transcription. This can include normalization, categorization, reference to external resources, formal knowledge representations, and many other forms of transformation.

The assertive edition is not a well-defined type of scholarly editing yet. However, assertive editions exist. The methods according to which they are created, modelled, and made available online are becoming part of scholarship. Digital assertive editions can be identified by the user interface and in the data structures, which try to combine the transcription with a database of statements made in the text. On the one hand, few historians have already implemented the concept. It allows them to employ source-oriented critical methods while working with large amounts of data. The majority of the historians still focus on the structured data extracted from the sources. Databases are their major tool, often employing rich interfaces and elaborate visualizations. The majority of scholarly editors, on the other hand, employ traditional methods of textual scholarship; they ponder complex transcription problems, evaluate variants, and include textual materiality. The combination (deep links between structured data and text with assertive editing) is still rare. One reason for this is the technological ability to realize such links. Tools like Pundit, frameworks like SyMoGIH or GAMS, and best practice examples like the projects cited above are steps in the process of addressing this.