The Semantics of History. Interdisciplinary Categories and Methods for Digital Historical Research

This paper aims at introducing and discussing the data modelling and labelling methods for interdisciplinary and digital research in History developed and used by the authors. Our approach suggests the development of a conceptual framework for interdisciplinary research in history as a much-needed strategy to ensure that historians use all vestiges from the past regardless of their origin or support for the construction of historical discourse. By labelling Units of Topography and Actors in a wide range of historical sources and exploiting the obtained data, we use the Monastery of Sant Genís de Rocafort (Martorell, Spain) as a lab example of our method. This should lead researchers to the development of an integrated historical discourse maximizing the potential of interdisciplinary and fair research and minimizing the risks of bias. DOI: 10.9781/ijimai.2021.02.002


I. Introduction
H ISTORICAL Science is a discipline devoted to the analysis and comprehension of the past for a better understanding of the present, and hence a potential forecasting of the future. Therefore, space and time are unavoidably the main scenarios of research in History, and the alternation between permanence and change is its main object of study. This analysis of the so-called historical time [1], in which different entities exist and events occur transforming the reality of the past into something new and different, leads historians to construct the past in a narrative form.
The fact that data related to time and change are present in any written, material or immaterial vestige of the past boosts -and even forces-interdisciplinary research in History. Despite this, traditional approaches have frequently focused on written vestiges preferably, while disregarding other sources of information such as archaeology, iconography, literature, and a wide range of social sciences and humanities, which have been often considered as complementary at their best. In doing so, the different vestiges of the past do not integrate within a single discourse, but originate parallel discourses that might incur into contradiction and potentially lead to endless academic debates.
The spatial and material turns in History [2]- [6] have challenged these old-fashioned approaches, and have led to a more accurate construction of the past. A new interdisciplinary research framework -in which the difference between SSH and STEM blur into a new paradigm of digital and FAIR science-challenges History, and requires an effort from different disciplines in order to explore common languages and codes for the construction of the past. In this process of science going digital, common and exchangeable units of information are required, despite the specificity of different areas of expertise. Within this domain, some experiences on NLP have been developed [7]- [8] and most of them are related to textual sources [9]- [10]. Despite this, the strengthening of concept-based and relation-based corpora for the development of NLP in digital humanities [11] focusing on the ontological approach to historical data suggests a challenging scenario for historians. Scholars dealing with historical science should face the need of rethinking methodologies and the way to use ICT in order to solve wider and more complex research questions and to take our investigation beyond the 20th Century historical issues. One of our objectives is to provide ontological reflections about data and data management in order to produce richer historical relates, as long as they integrate as many vestiges of the past as possible.
Our contribution aims at offering a methodological proposal and practical application our teamwork has developed within the last years, as an extended version of [12]. Arising from landscape archaeology and the study of the material vestiges of the medieval period, our research methodology deals with data labelling and records management, and nowadays it has overcome the archaeological domain to integrate all vestiges of the past regardless of their nature or origin to strengthen historical research in the digital domain. We will develop these issues within an updated state of the art in section II. In the following sections, we will describe the method according to the labelling categories we propose and the resulting data modelling. We will use a practical case of study -the medieval monastery of Sant Genís de Rocafort, Martorell, Spain-as an illustrative example.

II. Landscape Archaeology and Integrated History
History has not been the most enthusiastic discipline to join the so-called Digital Humanities so far, and data managing strategies have been widely challenged in our domain [13]. Far from being overwhelmed by the unknowns of this digital turn, a few exceptions deal with different ways of representing historical information [14]- [15] and the building of a semantic definition for historical ontology [16]- [18]. Recent experiences focus on quantitative data analyses [19] and, predominantly, on written historical texts [20]- [21]; and some of them struggle to find the best ways to deal with bias [22] and uncertainty [23]. Despite this, a normalized user-friendly code to exploit vestiges of different nature and support is still missing and historical knowledge seems to be restricted to its written apparel.
Our team developed a first proposal to identify minimum units of information and label them as Units of Topography, Units of Stratigraphy and Actors, as will be defined below, in the framework of Landscape Archaeology. The initial goal was to integrate vestiges both from written or material sources in the archaeological analysis of medieval landscape and, therefore, to explore landscape as a historical construct from a holistic perspective. More than twenty years later, our research information system has gone far beyond landscape archaeology or the medieval period to become a solid proposal for historical research, understanding Historical Science as a FAIR [24] construction of the past. It follows the principles stated by far-reaching research projects in this field such as the International Research on Permanent Authentic Records in Electronic Systems (InterPARES) [25] (pp. 6-7) or Parthenos [26]. This FAIR-ness is achieved when ensuring the reliability and traceability of the research process, and when integrating as many sources of information as possible, even those that had been traditionally disregarded until recently as marginal or non-significant. This is possible due to the development of ICT in the past decades. Recent advances allow for a significant speedup of data gathering and exploitation processes of much larger datasets, which opens a brandnew field for historical research in which new and more complex questions can be asked to past vestiges. Ensuring the existence of clear and unambiguous definitions of the ontology-mediated elements that identify units of information and their relations [27] is an underpinning issue to this regard. The following section deals with it.

III. Labelling Categories and Data Modelling
As defined by K. Thibodeau [28] (p. 7), an Entity is something that existed and an Event is something that happened or was done. Entities and Events have a relationship of involvement, as every event involves at least one entity that might be the participant in the event, its observer, the mechanism for the event to happen, or the object altered by the event itself. In terms of data-labelling, the categories Unit of Topography and Actor, as defined by A. Mauri [29] (p. 45), and their relations, provide the unique and univocal identifiers for historical facts regardless of their link to permanence (Entity) or change (Event), or the nature and support of the vestige. Units of Topography as we use them are, in fact, a wider conception of archaeological Units of Stratigraphy [30], which overcomes their materiality and turns them into a broader concept to identify any entity or action existing or occurring at a particular time, notwithstanding its presence or absence in the archaeological record.
The following definitions apply to each one of these categories: • Unit of Topography (UT): It is the evidence of an action or situation that can be located in space and time, regardless of the specificity of the information source and its biotic, non-biotic or anthropic attributes. Each UT has a specific location and date. Location can be expressed as a UTM coordinate or as an administrative delimitation that might have changed through time.
• Unit of Stratigraphy (US): It is the material evidence of an action occurred in the past, representing an archaeological aspect of the cycle of time. They are of universal character and can be found on any archaeological site in the world [30] (p. 42). As a reflection of materiality, graphic and cartographic representations are essential attributes of these units.
• Actor: It is the individual or corporative, active or passive, protagonist of an action identified as a UT. If being an individual, its attributes are their name, gender, religion, citizenship, date of birth and death, etc. Different individual actors gathered for a given period of time with a particular purpose and under determinate conditions can act as corporative actors.
As the US category is contained in the definition of a UT and we might consider them as equals at some point, Table I summarizes their differences and ontological specificities.
As shown in Fig. 1, several types of relationships can be set between UT/US and Ac. A UT can include, link or delimitate another UT. Hence, Inclusion, Delimitation and Link are classes of the UT-UT relation. An Actor always plays an active or passive role within a UT, so Role is the only class of Ac-UT. Actors can relate to other actors through familial, political, social or economic Ac-Ac relationships. The materiality of US implies that the only possible relation between US is physical contact. When interpreting the archaeological register, we can group several US into activities and assemble these activities into groups of activities according to [31]. The relation and equivalence between UT, as shown in Table I, and US is then a matter of scale: UT can be equivalent to US in the archaeological record, or we can define UT during the data interpretation process. Anyhow, the UT scale is the one allowing for an interdisciplinary approach in History.

Involvement in event
Essential (altered) Non-essential * The material register does not inform about Ac.

A. Site and Materials
We have selected a Catalonian monument -the monastery of Sant Genís de Rocafort-located at the town of Martorell (Barcelona) in order to put into practice our labelling proposal and demonstrate the validity of our information system from an interdisciplinary perspective. The Lord of Castellvell and his wife founded the monastery of Sant Genís in 1042 as the ruling centre of a small Priory [32].
Since then, the monastery, located on the top of a hill, became an emblematic element of Martorell's landscape, and down on the foothill the ancient temple of Santa Margarida became the parish church of the Priory, which had a flourishing period in 11 th -13 th Cent AD, even though the monastic community never included more than four or five monks. The monastery became dependent of larger monasteries within a Benedictine network, probably in the 13 th Cent. Since then, the monastery of Sant Miquel de Cruïlles (Girona, Spain) [33] had the patronage over Sant Genís, while being itself dependent from the Piedmontese Saint Michael's Abbey (San Michele della Chiusa, Italy). In the late middle age, Sant Genís started its slow decadence until 1534, when it became a secular Priory owned by a Barcelonese merchant named Joan Bolet. He slightly refurbished the buildings of Sant Genís and Santa Margarida, and the Priory endured until the 19 th Cent AD. After some years in private hands, the Priory became a public property of Martorell's Town Council in 1967.
Historical research at this site has benefited from the proposed system of management information. Architectural analysis of the building, restoration works, and archaeological fieldwork provide a new research perspective for the construction of an integrated historical discourse built from the written evidences kept in archival records and thoroughly analysed in the past [34].

B. Practical Examples of Data Labelling
Accordingly, we selected some written, graphic or material vestiges of Sant Genís' past and analysed them by identifying and labelling the US, UT and Ac informed within at different stages. Managing information throughout these lab examples in an integrated form provides a synthesis for the method's development. The following subsections deal with each type of selected sources, labelling them appropriately, and Table II and Table III summarize the information gathered so that it can be furtherly discussed.

Archival Sources or Written Primary Files
A copy of the donation document of 1042 kept at the Diocesan Archive of Barcelona [32] and published in [34] (p. 139-143) informs about the location and borders of the Priory. We can label texts and identify the units of information contained and their attributes by following this code: <UT00> <Ac00> <Att-UT00> <Att-Ac00> <Date-UT00>. We also label relations in accordance with concepts related: <Ac00-UT00> <Ac00-Ac00> <UT00-UT00>. A semicolon separates different UT, Ac, Attributes, or Relations identified through the same word or syntagmatic expression.

Photographic Vestiges and Architectural Analysis
Architectural analysis from an archaeological perspective aims at identifying the building phases and further transformations in architectural heritage. Photographic vestiges usually allow for the identification of constructive elements that have disappeared nowadays. We identified and labelled US in Fig. 2 accordingly:

Archaeological Fieldwork
Because archaeology is a destructive process, building a precise and detailed archaeological record following a clearly stated protocol has been the commitment of archaeological science for a very long time. US have been the main unit of information since 1980 [30].
In Fig. 3, we selected just three US for labelling as a lab example amongst the entire archaeological record at this site. The stratigraphic method for archaeological excavation and register is widely accepted among scholars in this domain and our UT/Ac labelling strategy is built in accordance to the archaeological method, as stated above, due to the authors' archaeological background. Therefore, we could not avoid selecting archaeological examples for building an integrated historical narrative, as the archaeological record is concomitant with our proposal for information management, even though the excavation results in Sant Genís are much wider than shown in this paper.

Bibliographic Reflections
In all scientific production and in any form of Past Construction in particular, the so-called state-of-the-art -or past reflections, in Thibodeau's terms [28]-are valuable sources of information that must be considered in terms of data labelling and management. The same labelling method proposed for archival sources or written primary files works for secondary information as well.
In Fig. 4 we show an example of data labelling within an excerpt from a published piece of research about the Priory of Sant Genís [34]. That is the summary and study of a rich documentary assemblage informing about the priory from archival sources, and providing the historical framework and state-of-the-art before the archaeological excavation started.  Table III. [35]- [36]. Interpreted relations between US and UT are recorded in Table III. Fig. 4. Labelling example of a short excerpt from the bibliographic reference of a previously published study [34].
UT/Ac data labelling is operational both on original texts when possible -even if annotated by hand, as in the example on Fig. 4-and also on transcriptions and translations. In section I above, we labelled on the transcribed Latin version of the Priory's foundation and below, the translation to the excerpt in Fig. 4 is labelled in the English language. Actors labelled in both examples are summarized in Table II.

Landscape Analysis and Cartographic Sources
The proposal introduced in this paper offers a useful tool for landscape archaeology as well, and we can label cartographic sources similarly. We must bear in mind that the concept of Unit of Topography -in accordance with the definition proposed-implies a location and date as main attributes for further exploitation. UT/US gathered from selected examples and included in Table III have a precise location expressed as UTM coordinates that have been used to produce and label the maps shown in Fig. 5.
To that extent, Geographic Information Systems (GIS) have a great potential for data processing and exploitation, and they provide a useful tool for the landscape approach to historical knowledge. Providing topographic and chronologic attributes for past entities and events is a conceptual requirement for a spatial turn [37] in History, but also for a general scientific procedure of Past Construction under normalized terms and categories.

C. Data Exploitation
Cartographic representations and data exploitation by means of GIS technology are both a way to process data and a final representation which may be used for publication and dissemination. Anyway, our proposal -originally thought from the domain of landscape archaeology-goes beyond spatial representations and can focus on temporal sequences or relational data interpretation as well.
Although we summarized data gathered from our lab example in a couple of simple tables, these form part of an ontology-mediated database in which UT, Ac and the relations between them are collected in separate tables. The diagram shown in Fig. 6 represents the main components of a database storing the crucial units of information, their attributes and relations. When keeping this structure, databases can adapt to the needs of a particular research project and show variable interfaces and self-search exploitations, but they will always be interchangeable and potentially interconnected, as far as they share a common data modelling [39]- [40], as shown in Fig. 6. Fig. 7 represents some interfaces of the current applications used nowadays by our team. One of these is a GIS-based information system created for archaeological purposes [41]- [42]. All of them share the same categories -UT/US/Ac-and hence they allow interdisciplinary research beyond the particular expertise of anyone of us.
When gathering and storing data in the form of tidy-structured tables with variables in columns and observations in rows [43], and according to identified US, UT and Ac regardless of the nature or support of the vestige, multiple representations are possible. Flux diagrams and matrices can visually establish the temporal sequence of activities and their permanence or transformation. Fig. 8 shows an extract of a historical Harris-like [30] matrix created for the Priory of Sant Genís the Rocafort including the Ac, UT exemplified throughout this paper. The archaeological US Harris-matrix routinely developed in archaeology is also included. This is an example of data exploitation and representation as an interdisciplinary historical matrix, including the archaeological results within the historical discourse arising from written evidence and explaining the Priory's past in a richer construction. Notice how the material vestiges of the buildings' refurbishment in the 16 th Century were positively identified within the archaeological register combined with the architectural analysis. In such a representation, Actors mentioned in written vestiges can be assigned to phases and located visually within the corresponding period.

V. Discussion: Towards an Integrated History
The most striking point of using Unit of Topography and Actor as ontological concepts of Historical semantics is that they allow for a truly interdisciplinary research. Unfortunately, today historical science understood as a whole still lacks a common code for data integration within its discourse. The methodological particularities of each method make sense as far as they follow specific goals and socially determined functions. This should not be a problem for creating an integrated construction of the past, as far as they share a common system for information management and exchange, which -unfortunately-has not happened yet [44] (p. [41][42]. In a context of FAIR research, the aim to create an integrated historical discourse is a challenge that historians should face with a sense of urgency. Nevertheless, interdisciplinarity in history does not mean -or should not mean-juxtaposing different past constructions arising from each discipline (history, archaeology, literature, iconography, archival science, linguistics, law, and SSH in general), but creating an interdisciplinary narrative joining the efforts of many different scholars. Sharing a common system for information management and exchange allows us to monitor the research process from the beginning and to locate information precisely, which makes it findable and accessible to colleagues from diverse expertise domains. This is the best way to deal with bias and uncertainty.

ID Number De inition
Descriptive A rib. Graphic A ributes Cartographic A rib.     Our proposal suggests widening the scope of the archaeological method as a response to this challenge. As defined in [30] (p. 42), the concept of US includes any kind of action leaving a material imprint and identified within spatiotemporal coordinates, no matter if it is positive -adding materials-or negative -removing them. Therefore, during the archaeological fieldwork we register positive US when stratigraphic accumulation of materials occurs and we identify negative US in holes, broken structures or eroded layers. This concept has proved to be wide enough to be adopted by other archaeology-related disciplines such as architectural analyses of buildings and material heritage studies [45] (p. 79). Since materialityand, therefore, its cartographic and graphic informative dimension-is the main feature of US, could we define a similar concept equivalent to this unit of information but delinked from its material component? Yes, we could. Units of topography provide this univocal identification of entities and events in the past, with spatiotemporal coordinates and relations between them. Archaeology does not inform about actors, but many other sources of historical information do. Therefore, the proposal of UT/Ac gathering is an adequate compromise solution in order to develop an ontology for past construction in which entities and events are identified through non-ambiguous parameters.

Stratigraphic sequence (=Related US) Font: <Table b>
Successful data labelling strategies (TEI) are limited to written sources and hence increasingly used in literature and language studies [46]. They might be successfully applied to the written examples we provided, but they fail in labelling iconographic or photographic vestiges. Textual encoding and labelling tools have a great potential for written vestiges from the past, but historical science needs an additional category for data labelling regardless of the origin or support of the vestige. Our labelling proposal implies looking for UT and Ac in too many different shapes and supports -even if textual sources are the most abundant. This actually implies more interpretative knowledge on the historians' part, as it is not possible to detect these data units through mere automatic data labelling applications yet.
Furthermore, UT/Ac identification allows for multiple readings of past vestiges, which can be as exhaustive as required in a particular research project. Data gathered in the examples provided throughout this paper have been enough to demonstrate the validity of the method. We have attempted to find a balance between a theoretical demonstration and a practical case of study, but the historical sources informing about the Priory of Sant Genís de Rocafort are much wider and the historical matrix arising from them is far more complex. This also shows how the historian can read, analyse and interpret past vestiges to a desired level, according to their interest or domain of expertise, and how future experts dealing with the same vestiges can then generate new knowledge building upon previous reflections, but not disregarding them.
In recent years, data modelling and database construction in the terms described in this paper have allowed us to develop integrated approaches [29], [47] and software [42] overcoming the traditional inconveniences arising from the fragmentation of sources of information. Interactive multimedia and artificial intelligence have a great potential to automatize research processes and have proved to be novel and useful in the domain of SSH. Research projects in the field of History, Archaeology or Archival Science can benefit significantly from shared and transdisciplinary approaches to the past when using a common code. We consider the dialectics between US/UT and Ac as useful categories for data modelling, according to the semantics of Entity and Event as major ontological concepts in historical science [28], [48].

VI. Conclusion
Historical science is a wide discipline that has to consider all the sources of information available, which implies several other disciplines taking part in this process. Archaeology, linguistics, literature, and many others provide valuable data to contribute significantly to the   Table II and  Table III are included. construction of the past. Historians should not add some of these data to a main discourse arising only from written sources but integrate all this information within interdisciplinary processes of data gathering and exploitation.
The NLP community has presented several attempts to process historical knowledge, according to two underpinning -one theoretical and one practical/methodological-ideas: the definition of event and data extraction through text labelling. In doing so initiatives share the common limitation of not having a precise and shared definition of event amongst the academy, and the data extraction procedure being limited to written sources. Furthermore, no attempt has been made to find a domain-specific definition of event combining the historical perspective and ongoing research in the NLP field.
Hybrid intelligence would be, to our perception, a challenging field to explore the possibilities of historical knowledge to become digital and interdisciplinary, and to develop appropriate UT/Ac recognition patterns. NLP systems might be focussed on finding and tagging event-meaningful concepts in written sources, even including archaeological excavation documents and text-supported records, but the heterogeneity of supports and formats for historical vestiges are much wider than these.
The concept of Unit of Stratigraphy, broadly used in archaeology, provides a useful characterization of actions in the past according to their materiality. Widening this idea, the categories Unit of Topography and Actor, as described throughout this paper, provide single and univocal semantic concepts to identify entities and events. Building databases according to these categories is a valuable strategy that integrates knowledge both from SSH and STEM to the historical domain, and made information systems interoperable, ensuring the traceability of the entire research process. While there is a range of opportunities of automatizing processes in terms of text labelling by tagging Units of Topography and Actors, there is still a need for trained and experienced historians who decide the level at which data have to be recorded. Anyway, there is a considerable potential in terms of data exploitation and visualization, in which ICT in general should definitely contribute.