Exploring Data Provenance in Handwritten Text Recognition Infrastructure: Sharing and Reusing Ground Truth Data, Referencing Models, and Acknowledging Contributions. Starting the Conversation on How We Could Get It Done

This paper discusses best practices for sharing and reusing Ground Truth in Handwritten Text Recognition infrastructures, and ways to reference and acknowledge contributions to the creation and enrichment of data within these Machine Learning systems. We discuss how one can publish Ground Truth data in a repository and, subsequently, inform others. Furthermore, we suggest appropriate citation methods for HTR data, models, and contributions made by volunteers. Moreover, when using digitised sources (digital facsimiles), it becomes increasingly important to distinguish between the physical object and the digital collection. These topics all relate to the proper acknowledgement of labour put into digitis-ing, transcribing, and sharing Ground Truth HTR data. This also points to broader issues surrounding the use of Machine Learning in archival and library contexts, and how the community should begin to acknowledge and record both contributions and data provenance.


I INTRODUCTION
Within the humanities, working with digitised (primary) source material is no longer a novelty.Due to both large and small projects over recent years, an increasing number of digital sources have become available.Most of these projects have been realised and enriched with Automatic Text Recognition (ATR, machine learning-based text recognition for print and handwriting) techniques.Although the resulting data sets of machine-readable texts are immensely promising for the humanities, these developments also inevitably challenge existing disciplinary practices.This paper revolves around several challenges tied to preparing and publishing ATR results.No clear practices have been established on how digital resources like ATR recognition models and training material should be properly stored and cited.We lack a clear guideline on how we should make people aware of the several layers of contributions in publishable products.These are the perspectives that require in-depth elaboration.
ATR and, more generally, the latest engines for text recognition processes depend on the digitisation of sources and the production of transcriptions to create and synthesise models via machine learning.For general models, massive numbers of documents, accompanied by correct and (ideally) uniform transcriptions, understood as Ground Truth in machine learning, are fundamental; the production of these corpora is, therefore, a challenge that falls in the category of big science (e.g.[Chawla, 2017]).Groups of volunteers (citizen scientists) are frequently involved in this data creation process, which raises the question of how we should properly acknowledge their contribution.Moreover, when talking about the digitisation efforts of the Galleries, Libraries, Archives, and Museum sector (GLAM), we should acknowledge the production of digital images of documents. 3o discuss this problem, we organised a hybrid workshop at the Transkribus User Conference 2022 (Innsbruck, Austria).In the context of Ground Truth creation, we aimed to discuss: how can we properly reuse, reference, and acknowledge contributions?What are the best practices thus far?Many participants shared our sense of urgency for these questions and proposed fruitful ideas.This paper is the result of that exchange, via a resulting writing sprint with the community.This paper contributes to ever more important workflow processes for data generation based on shared and highly practical experiences.This article departs from the concept of Ground Truth.This concept stems from computer science, claiming that an object can be described as it is.From a philosophical and epistemological point of view, this is highly problematic.Supervised machine learning algorithms require Ground Truth to imitate the result in the form of a model.As the term Ground Truth suggests, it is a form of data that adheres to specified standards and is considered, at least by a group of people, to be an accurate representation of the material, in our case, handwritten or printed material [Muehlberger et al., 2019, 957].This form of representation informs us about the accuracy of algorithms since Ground Truth is partially used to measure errors Initial transcriptions may contain quite a few mistakes, but thoroughly checking them -most often by a human -can lead to accurate transcriptions according to defined standards.Ground Truth should thus be understood as the 'gold standard' ideally being reached.Alternatively, as Muehlberger et al. [2019, 957] describes it: '[Ground Truth] is a term commonly used in machine learning to refer to accurate, objective information provided by empirical, direct processes, rather than that inferred from sources via the statistical calculation of uncertainty.'As such, it can function as bench-marked data.Having as much Ground Truth available as possible is essential to provide large (or even general) models for specific scripts or types of handwriting.However, once large models are available for fine-tuning (in the sense of transfer learning) a reduced amount of training data is needed.
Ground Truth can be drawn from many sources.A bespoke transcription can be produced from scratch for a specific ATR project, but it is often more efficient to adapt Ground Truth from a transcription or edition that already exists.This raises the issue of varying or conflicting transcription conventions that may not be easy to identify but can impact the project to which the Ground Truth, or combination of Ground Truths, is to be applied.Suppose the Ground Truth is to be shared and potentially bundled into multiple models.In that case, such conventions must be included in the description or metadata, or at least made available in some form.This will help potential future users select the Ground Truth that is most appropriate for their project and help explain certain behaviours of a model.Generally and roughly speaking, there are various ways of producing transcriptions.Two frequently used approaches are diplomatic and semi-diplomatic transcriptions.The former transcribes as much as possible as is, taking a large character set into consideration; the latter allows for adaptations to improve readability, e.g.writing out abbreviations and simplifying some characters.Some transcriptions are hyper-diplomatic, in the sense that ligatures, such as 'st' ligatures, are transcribed, or that types of 's' (e.g. as 'long s') or 'r' are distinguished.Machine learning-based models are indifferent to character sets, however, their capabilities are confined to the scope of their training data.
From a legal perspective and because of the data's value due to its laborious production, Ground Truth should at least be understood as data (by)product of a project and considered for publication.In many legal systems, the creators or producers of data can provide Ground Truth independently of image rights.However, since training processes for Handwritten Text Recognition models require both Ground Truth and images, image rights can often present a challenge to the (re)training of these models.In any case, we should store different stages for future reuse.
In the first part of this article, we contextualise strategies within the ethical and legal limitations of sharing Ground Truth.Because of these limitations and the urge to make people aware of the labour that is poured into data creation, the reuse of Ground Truth requires that contributions and contributors be acknowledged, which is discussed in the second section.In our conclusion, we combine and synthesize the two parts.This article is a proposal intended to start a discussion about how to conduct and acknowledge the work that goes into generating training data for machine learning.It must be mentioned that the proposed solutions are not meant as definite, or to provide a complete overview of all thinkable options.Additionally, one should remember that this article is the result of a large group of people with varying backgrounds.Consequently, we want to make the community aware that definitions may vary according to different fields, and we will not elaborate on all perspectives.

II SHARING GROUND TRUTH
Much labour and resources are poured into manually and semi-manually producing Ground Truth transcriptions.Reusing transcriptions -and their associated images -promises to support small(er) projects and institutions with various materials and speeds up their work greatly.Furthermore, to advance digital techniques, all available material could provide valuable training data for future projects and (new versions of) tools, like ATR engines, or other downstream tasks, such as language models for Named Entity Recognition [Ströbel et al., 2022].However, sharing transcriptions, e.g. in a repository, is, in our opinion, not enough and does not fully adhere to the FAIR (Findable, Accessible, Interoperable, Reusable) principles [Wilkinson et al., 2016]: it should also be (easily) findable by others.Still, sharing data can have legal and/or ethical limitations.It should be stressed that we are explicitly talking about sharing Ground Truth data and not about sharing ATR models in this section.

How to Export Data
The various programs that allow for the creation of ATR material have options to export the generated and/or corrected transcriptions.When possible, both the transcriptions and images should be exported, depending on any potential copyright or image rights. 4If this is not possible, it is helpful to at minimum sustainably store the "pure" transcriptions.
Within the Transkribus tool, provided by the READ-COOP SCE, the export appears as shown in 1. 5 Widely used standards, like ALTO XML, PAGE XML, and hOCR allow for an align-ment between image and transcription -based on coordinates -which is required to connect transcribed text on a character, word or line basis with images and allows for the opportunity to (re)train models based on machine learning.These two formats are also supported by the eScriptorium application (see 2), which has been developed in the context of a variety of national and European projects [Kiessling et al., 2019].ALTO and PAGE are the main formats used to store ATR output.TEI, better known in the Digital Humanities community, is primarily dedicated to producing critical digital scholarly editions but could also serve as a long-term storage format due to the wide user-base.The Gallicorpa project follows this approach and proposes TEI as an exchange format [Pinche et al., 2022].Although it is hard to predict future developments, we are optimistic that at least a future conversion from PAGE and ALTO XML to future standards will be possible.As a consequence, we encourage exports in these formats.Both PAGE and ALTO XML are open data formats defining an XML structure while keeping the option more or less open to adding custom properties.Exporting valid TEI XML as a third option is also sensible to us.
While some would call for a centralised Ground Truth repository, this could be a costly affair, and result in double the work, as funding agencies can have requirements to store the output in available.
specified (e.g.national) repositories. 6Consequently, a solution to this problem of the decentralised distribution of sources is discussed below.

Publishing Data in a Repository
Generally, storing data in a FAIR-compliant, noncommercial repository with a persistent identifier, like Digital Object Identifiers (DOIs), is preferred.At the same time, it is highly encouraged that data output be made accessible in a structured format.Images and XML files should reside in sub-folders, with descriptive names for folders and images.
Repositories such as Zenodo offer the possibility of adding structured metadata that includes the name of contributors, licenses for reuse, and (if applicable) URLs to external web pages.Besides added information, it is essential to add a README file to a published data set.This helps navigate the data dump and allows for straightforward reusability [Sicilia et al., 2017].Alternatively, data can be provided using publicly available Git repositories such as GitHub or Gitlab, but these do not offer DOIs.To both make use of user-friendly git environments and receive a DOI, a mixed solution is a possible way forward: version management can be done through GitHub, while Zenodo stores versioned and frequently updated data sets.Conveniently, some platforms like GitHub allow a repository to be linked with Zenodo semi-automatically.GitHub is then used to handle the versioning and creation of releases.At the same time, Zenodo provides the user with a DOI, making the repository findable in the Zenodo search engine (see 3).If set in place, this allows different versions of transcriptions and documents to become available online based on different parameters or (underlying) ATR models.At the same time, it is vital to establish whether manual Ground Truth or automatic transcriptions were used to determine the quality and the source of a data dump.Whichever version one posts, it should be transparent to other potential users.
At the same time, it is helpful to provide transcription guidelines or manuals to inform users about rules guiding the process and characteristics of the transcription of documents.In connection with particular Ground Truth, this information will allow potential users to search for data sets adhering to transcriptions that fit the criteria they are interested in [Sahle, 2016].For example, the textual output could include larger or smaller character sets or only parts of a 'document' could be published.
Sharing data through a repository is not only beneficial but also aligns with good academic practice.To ensure the data is not just accessible but also easily discoverable, following FAIR principles, considering a link from the repository to a sharing platform such as HTR-United is essential.7

HTR-United: Sharing Your Data
Several programs allow for the creation of ATR data.Regardless of the tool used, it is up to the creators whether or not they want to share their work.Given the enormous diversity of existing repositories where work could be stored, there is an increasing need to have an overview of available Ground Truth data sets or, if possible, open-sourced models.Furthermore, the relative novelty of the output type requires new standard practices to publish them.Alix Chagué and Thibault Clérice [Chagué and Clérice, 2022a] developed the HTR-United initiative to bring together different Ground Truth sets (see 4).HTR-United consists of three imperatives: 'a collaborative enterprise for the community; friendly to consumers and data producers; as low tech as possible (because $$)' [Chagué and Clérice, 2022b].Furthermore, according to [Risam and Gil, 2022] 'minimal computing connotes digital humanities work undertaken in the context of some set of constraints.These could include lack of access to hardware or software, network capacity, technical education, or even a reliable power grid.' Figure 4: Website of HTR-United.https://htr-united.github.io/index.html[30 September 2022] This much-needed initiative offers a solution that is easy to use and access, allowing contributors to store their data set at any given location, preferably with a DOI.It also centralises an overview of those Ground Truth data sets.The HTR-United interface allows users to filter Ground Truth by language, script/type, and periodisation.Furthermore, the catalogue contains metadata in .YML format, updated through continuous integration through GitHub Actions.Chagué and Clérice developed a form that simplifies the process of creating .YML files and badges and uploading metadata in the catalogue [Chagué and Clérice, 2022b, slide 15].The developers (and at the same time, initiators) -amongst whom our co-author Chagué -know that adding a form with questions increases the complexity of adding data.However, they think it is worth the effort as it provides a uniform overview of the digital environment.
Figure 5: Catalogue of HTR-United.https://htr-united.github.io/catalog.html[30 September 2022] HTR-United limits itself to a predetermined way of sharing Ground Truth, and does so for practical reasons.Providing a relatively strict schema for the catalogue allows for a machineactionable method of checking the conformity of the submissions.Also, it supports searches across the catalogue [Chagué and Clérice, 2022b, slide 10].
From the catalogue on the HTR-United website (see 5), it is possible to download the metadata into Zotero as an 'Item Type: (Digital) Document'.This download option simplifies the future referencing process (see fig. 6).8To briefly conclude the section on sharing the data, we would like to emphasise four key approaches to processed textual data for future text recognition.
• Export your data (including images, if possible); • Upload it online, using services compatible with versioning like GitHub or better in university repositories or Zenodo; • Get a DOI, make it a citeable publication; • Make others aware of it (through HTR-United or other possible means).
In the above, we focused on sharing Ground Truth, or texts that have been corrected manually.However, when models perform well, we may reach a point where sharing large data sets of raw ATR-produced transcriptions would also prove helpful, even though they are not perfect due to errors.However, they should be explicitly designated as machine-generated transcriptions, in which case it is necessary to note the measured or assumed Character Error Rate (CER) [Hodel et al. 2021, 13;Cordell 2017;Cordell 2020].In the case of such machine-generated transcriptions, it could possibly be advisable also to indicate the model used.Although CER is often used to measure quality, the calculation varies in the different tools, as recent studies show [Neudecker et al., 2021], so it is necessary to mention the used tool.

III REFERENCING DIGITISED RESOURCES AND DIGITAL OUTPUT
For certain objects in the humanities, such as physically published books, it is obvious how to cite them, and it is clear what questions need to be answered in a citation.It should to state who wrote the text, who contributed, and what the source was.An exact structure must be followed, depending on the citation style.For this section, we focus on referencing digital objects, whether they are resources (digitised texts), data sets (recognized texts) or even ATR models.Compared to manuscripts, prints, and other forms of written documentation referenced for centuries and even millennia, approaches to dealing with digital (ephemeral) objects are in their infancy [Föhr, 2018].
Several software solutions exist for creating, collecting, editing, and reusing bibliographic references for annotation purposes.These include, to name only a few, EndNote, Citavi, Zotero, and Mendeley.Zotero is a free, open-source referencing tool provided by the Corporation for Digital Scholarship that can adapt to various referencing styles.As it is a free and open-source tool that has been programmed by and for humanities scholars [Takats, 2010], we use Zotero as a point of departure for suggesting how to reference and acknowledge digital (re)sources and contributors. 9We have combined experiences, suggestions, and guidelines in this section.As above, we focus on FAIR and CARE principles (see section 3.3), while striving to use persistent identifiers.The primary focus will be on determining the appropriate occasions for citing digital resources, identifying the essential elements that must be included in such citations, and recognizing the specific attributes of a digital resource that warrant acknowledgement.

Referencing Data Sets
For the humanities, data models, if at all, have only been cited in recent times, which results from a lack of standards within the field.10However, in computer sciences and machine learning, guidelines on how to cite data sets and software exist and are mostly adhered to [Gebru et al., 2021].11 Several kinds of data sets could and should be cited.First, transcriptions, which include information about where on an image page a specific word or line is situated.These transcriptions encompass manually created Ground Truth and machine-generated transcriptions and anything in between, such as machine-generated but manually corrected Ground Truth.Second, there is text enrichment or, more generally, semantic annotation, e.g.georeferenced place names, named entity recognition, and linking terms to authority data.While these enrichments may be integrated within one overarching data set, what has been done and/or used and by whom should be clearly stated in all circumstances.
Standard literature management software is only beginning to incorporate citation of data sets and software.Zotero, for example, is, as of 22 September 2022, not supporting output types like 'data sets' or 'data/ATR models', though they state that the category 'data sets' will soon be added. 12In this way, the ability to cite these kinds of scholarly and scientific contributions will be easier and hopefully, part of future releases of large data sets that acknowledge such contributions accordingly. 13ince data [Gitelman, 2013], models [Speer, 2017], and even concrete objects [Woolgar and Cooper, 1999] are never neutral, we need to think about metadata and data publications not only in terms of citation technologies but as a means to an end in itself.Over the last few years, the potentially egregious effects of using skewed or biased training data have been more coherently acknowledged in computer science, machine learning, Natural Language Processing, and other data-intensive fields [Mehrabi et al., 2022].Some work has been done in these areas, particularly from perspectives of data ethics and algorithmic bias.One approach is to apply bias mitigating algorithms or causal inference models as in-analysis mitigation strategies.
Another approach is ensuring sufficient pre-analysis documentation to allow for the responsible use of data.As [Gebru et al., 2021] state, bias may be mitigated by 'careful reflection on the process of creating, distributing, and maintaining a data set, including any underlying assumptions, potential risks or harms, and implications of use'.Thus, responsible metadata does not merely encompass the application of FAIR principles [Wilkinson et al., 2016] and sufficient provenance information; it also details why the data was gathered, for what research purposes and to what end the research was conducted, which relevant tools and technologies were used in the collection process, and if and how it underwent possible transformation processes (selection and 'cleaning') and/or annotation ('labelling').All this information is essential in to determine if a data set can be used or repurposed for specific research.data sets that do not provide such information should probably be treated as suspect and with the greatest of reservations, or at least tested in depth.
Unfortunately, because of the incredible variety in format and content of humanities digital data and resources, no single agreed-upon metadata schema let alone data schema exists that serves all purposes, needs, and contexts of researchers.The heterogeneity of humanities data is only matched by the prolificacy of metadata standards, of which at least three hundred exist [Riley and Becker, 2010].However, the salient point is not that a particular data standard should be primary, but that a trustworthy data source will clearly state to which metadata schema its (meta)data is adhering.
Clear and comprehensive metadata allow for correct and comprehensive referencing and citation.As with digital data standards, there is no agreed-upon standard for referencing data sets.However, like research software, data sets should best 'be cited on the same basis as any other research product such as a paper or a book' [Druskat, 2022].Proper citing of data sets facilitates research transparency and ensures credit and accountability on the part of the data set producers [Ball and Duke, 2015].Metadata fields that should be part of any data set citation, if known, include author, publication date, title, version, resource type, publisher, identifier, and location.

Referencing ATR Models
In parallel to data sets, the 'Item Type: Software' could be used for referencing ATR models, as is suggested on the Zotero forum.14This 'Item Type' requests information such as title, programmer, abstract, series, version and date, programming language, URL, and rights.Questions arise about whether such an 'Item Type' is suitable for ATR models or whether other disciplines might offer more fitting approaches.To begin, an inventory of elements essential for citing an Automated Text Recognition (ATR) model will be compiled.
Among these elements, the authors propose including in the model annotations a feature that extends beyond a mere URL: the incorporation of a Digital Object Identifier (DOI) for each ATR model.This could be implemented by automatically generating DOIs, either during public sharing within systems like the Transkribus infrastructure or through external platforms such as eScriptorium, facilitated by uploads to repositories like Zenodo.Another possible desired integration would be with ORCID, to be unambiguous about the creator(s) of an ATR model.At risk of further complicating the issue, we would also advise mentioning the programmer of the training and evaluation algorithms (the text recognition engines).
An added layer that keeps coming up is that of the quality of a model, expressed in Character Error Rate (CER) and the number of tokens this has been based upon.Both the CER of the training set and the validation set, as well as their respective sizes, are informative data to judge the quality of the ATR model and its tendencies to overfit [Hodel, 2020].
To further complexify the situation, new models can be developed using existing ones as a foundational 'base model', a process known in machine learning as 'fine-tuning'.Base models can also be stacked while creating the ideal model.By principle, the entire stack of base models preceding any new base models should be referenced.
As mentioned above, Zotero supports an 'Item Type' called 'Software'.However, in disciplines such as computational sciences and machine learning, such a generic designation falls short of describing the diverse digital objects that may currently be produced in any scientific domain, and it is, in any case, insufficient to cover ATR models.Congruent with what has been said about metadata and data sets, we need a quite granular schema for describing ATR models.
Mitchell et al. propose a 'model card' to inscribe sufficient metadata and context about a model [Mitchell et al., 2019].Such model cards have been implemented in the Hugging Face repository, the current go-to repository for publishing data sets, models, and documentation for NLP models used in AI technologies.Metadata fields include model description, intended use, a how-to for application, limitations and bias, a description of the training data and procedure, evaluation methods and results, and a suggestion for how to cite the model. 15ATR models, which essentially merge character-based language modelling with computer vision techniques, share a close resemblance to the language models available on Hugging Face.The same model card metadata scheme would therefore be a good fit, and a solution to inform users of bias and editorial decisions.This would also allow communities to strive for a better understanding of what different practices of preparing and curating data sets exist.
As for citing models, we suggest the same approach as suggested for data sets in the previous section.ATR models should be cited on the same basis as any other research product [Ball and Duke, 2015].Consequently, the metadata fields to include for ATR model citing are congruent with those to use for data set citation: author, publication date, title, version, resource type, publisher, identifier, and location.Right now, data is, unfortunately, often put on the Web without any of this information being present.

Ethics and Limitations of Sharing
Those sharing data must be aware of the ethical implications of doing so and how to handle them.These can be regarding economic or societal aspects or related to personality rights, among other things. 16Questions include, but are not limited to: Does the sharing contribute to the sharer's subsistence?Who can contribute more to society by having (some control) over the data -e.g. by improving an ATR platform?For how long should the data of people in the documents be protected?In this section, we will briefly venture into these aspects of sharing Ground Truth and ATR models to indicate various points of view without siding with either.
Without going deep into discussions about business models of services and platforms, different trajectories to guarantee sustainability can be taken.READ-COOP SCE does 'share as much as possible, and retain as much control as necessary' to sustain its business and maintain its infrastructure.eScriptorium (as a second example) provides its software as open source but no or only limited server space and power to train and use models is offered.In both approaches, the sharing of Ground Truth and recognized text is foreseen and possible, allowing to switch between systems and making vital data available.17 From an ethical rather than legal point of view, it is crucial to think about creators, curators, and descendants of the people who created the material in question -which is the focus of the third section of this article.Especially when working with historical materials originating from colonial contexts, one must consider the biography of a document and describe how it became part of an institution, as well as consider the possible consequences of making documents or sources publicly available [e.g.[Ortolja-Baird and Nyhan, 2022]].Other considerations from non-Western communities may have very different models and understanding of ownership and what it means to respect the content of historical documents.Thus, the consequences of working with and sharing data must be kept in mind.For this reason, in addition to FAIR principles, CARE principles need to be considered, since they cover a multitude of aspects and have been proposed by the Global Indigenous Data Alliance.CARE does not have the same standing as FAIR for the moment, but it brings ethics into the discussion as a key aspect, it asks for the collective benefit of data production and sharing, and it demands that communities keep the authority to control 'their' data, while all players act responsibly. 18In short, CARE stands for 'Collective benefit', 'Authority to control', 'Responsibility', and 'Ethics', making us aware of the necessity to think about people and cultures that are being treated as data and to give those affected a voice to consider [Carroll et al., 2020].
An example from the NIOD Institute for War, Holocaust, and Genocide Studies illustrates the challenges that sources can bring to the surface.In the ATR-based digitization project titled 'First-Hand Accounts of War: War Letters (1935Letters ( -1950) ) from NIOD Digitised,' several challenges emerge.One key issue is the traceable personal information in these letters, which, if published, would violate the General Data Protection Regulation (GDPR).Additionally, ethical concerns stem from the potential impact that disclosing such information might have on relatives or third parties involved.This is further complicated by past agreements with donors who imposed restrictions on their archives and the possible application of the author's rights to the original texts [Keijzer et al., 2022].By considering ethics as one important part of data publication, CARE has partially been accounted for in this case.
To take just one example, Dutch legislation has not specified in detail how to deal with these issues.The community of archival professionals has provided additional but informal guidelines.'Werkgroep AVG' (Workgroup GDPR) of the Royal Society of Archivists in the Netherlands (KVAN) illustrates how a data controller can comply with legal and ethical restrictions. 19The strategies relevant to the case at hand require anonymisation, pseudonymisation, data minimisation, retention period and timely deletion, privacy 'by default', honouring the rights of whom the data concerns, and information security. 20  Legal and/or ethical restrictions do not necessarily imply the impossibility of sharing Ground Truth transcriptions or machine-generated transcriptions with a larger public.The strategies mentioned above show how customised approaches and technical and organisational measures can offer a solution to dealing with these restrictions.

IV ACKNOWLEDGING CONTRIBUTIONS
When we consider the proper acknowledgement of data sets and ATR models we should not forget that their creation was a joint effort.As Ground Truth and transcriptions that underlie ATR models are often supported by 'the crowd', volunteers, or citizen scientists as a joint effort, and digitisation is often the result of institutional activities, we would like to address issues that come up when acknowledging these contributions in this penultimate part.

Acknowledging the Crowd or Citizen Scientist
In an increasing number of digitisation projects, 'the crowd' is essential in generating Ground Truth data by transcribing or correcting transcriptions which are then used for the training of new ATR models.Acknowledging the crowd is important not only due to their hard work but to provide insight into how and with what resources Ground Truth data was produced.Properly citing the crowd contributes to a more transparent data production process.However, there are 19 Working Group GDPR (Werkgroep AVG) of Information and Archive Knowledge Network (Kennisnetwerk Informatie en Archief -KIA), "Weten of vergeten?Handreiking voor het toepassen van de Algemene verordening gegevensbescherming in samenhang met de Archiefwet in de dagelijkse praktijk van het informatiebeheer bij de overheid" [2020,[33][34].See: https://kia.pleio.nl/attachment/entity/a8e1caa5-0d59-4267-bbc0-4cd288b2a56c. 20The seven points of this strategy refer to the following.The first is A nonymisation: Altering personal data to prevent identification of the individual, directly or indirectly.The second is Pseudonymisation: Modifying data to allow identification only with additional 'key' information, kept separately for security.The third is Data Minimisation: Storing only the essential personal data for the intended purpose, thereby reducing risk.The fourth is Retention Period and Timely Deletion: Setting a fixed storage duration for personal data and ensuring its deletion post-period.The fifth point is Privacy 'By Default': Integrating privacy controls like authorized access and monitoring directly into the system.The sixth point is Honouring the Rights of Data Subjects: Allowing individuals to view, edit, or delete their data, with exceptions handled through a balanced approach.The final point is Information Security: Protecting data via risk analysis, classification, and audits to prevent unauthorized access or breaches.
no clear standards yet for how this should be done.The following section deals with the question of how to acknowledge the crowd sustainably and fairly.We focus on the recognition and reward of the labour that has been poured into projects through the many hands of volunteers, and we look at the best practices of various projects and make new recommendations.

Acknowledging the crowd: current situation and room for improvement
Using the existing landscape of crowdsourcing projects as examples, we find roughly two different methods of acknowledging volunteers.First, some projects refer to their volunteers in general, as if they were a homogeneous group (see figures 7 and 8).Some do so for practical reasons, others to intentionally emphasise the collective effort instead of the individual.Second, there are projects, especially smaller ones, that acknowledge their volunteers by listing them with their full credentials in recognition of their work (see figures 9 and 10).In our view, and in line with the previous sections of this article, these acknowledgements should be incorporated into the publication of the actual resulting data sets, too.How should that be done?It is understandable that, due to administrative labour, larger projects in particular tend to acknowledge their volunteers in a more generalised manner, but there are also arguments in favour of listing members of the crowd as individuals in the case of Ground Truth publication.We want to provide three such arguments.First, choosing to name individuals is a more personal acknowledgement of their pivotal role in the data production process.Some volunteers appreciate being named for their efforts, and listing specific names gives credit to those deserving.Second, acknowledgement by name in the case of a published data set can also serve as a certificate of participation for members of the crowd.Participants can then list the data set as a publication in their CVs, which allows them to demonstrate their knowledge of digital skills.These skills are especially important considering that humanities students, interns, and young programmers make up part of the crowd in many projects.Third, acknowledging individuals as contributors to a data set provides transparency to (future) users on how and by whom it was created (see also section 4.1.2).
Experience teaches that in many crowdsourcing projects, a small group of individuals contributes the majority of the work.Additionally, there often is a somewhat larger group of individuals who contribute regularly.Many of the volunteers, however, only make a limited contribution, after which they quit, or never actually start the work at all.In these cases, one could consider only naming the volunteers who have exceeded a specific threshold of work.A personalised recognition could also provide the space to list the people who delivered most of the transcriptions first, whereas those who made smaller contributions are placed last on the list.Alternatively, instead of ranking members of the crowd for their contributions, names could be attached to the individual documents or even pages, they transcribed.As such, not only credit is given to the person who produced the data, but insight is also provided into the quality of individual transcribers' contributions.
While the above certainly provides future users with more transparency in the data curation process, it is essential to keep in mind that the idea from which crowdsourcing projects departed is that every contribution is welcome and valued.Many volunteers who start a new project are insecure about their palaeography skills, and not every participant can contribute substantial work due to personal situations.One should thus be cautious about ranking, as this could be considered a (dis)qualification of their efforts.If at all, ranking volunteers or attaching individual transcribers' names to their specific contributions should be done in a motivating and engaging way.If a positive outcome of ranking is uncertain, it is advisable to list the names alphabetically.

GDPR issues: opt-in or opt-out?
While listing individual citizen scientists is something to consider, there are some hurdles to take into account when publishing such a list.According to the European Union's General Data Protection Regulations (GDPR), a person's name is personal data.In this case, when listing the names of individual contributors, those people should be informed, and consent for using their names needs to be sought.
Future complications could be avoided by presenting the citizen scientists with a digital form asking them to check a box if they agree to be named in a publication before they apply to the project.Thus, they can knowingly opt-in.It is crucial that such a form clearly states how exactly their name would be used, as part of expectation management, if the participant allows for their Figure 9: Part of the project website of Pardons; here, the names of all volunteers are listed (https://pardons.eu/the-team/).[31-10-2022] name to be used at all.Under what conditions are names listed?Should a certain threshold have been met before a person is acknowledged?Are the names in alphabetical order, ranked, and/or even connected to the individual output?The form should also provide information on how personal information is stored and kept safe.
However, one can imagine that, especially for larger projects which have already started, asking every individual member of the crowd for their consent can result in an administrative nightmare.There is an 'opt-out' method for these cases to deal with the GDPR.Opt-out refers to a situation in which people are presented with the statement that data will be published with their names unless they themselves reach out and express their demand to be excluded to a specified person within a specific, reasonable time frame.It is sufficient for projects to send the option to opt-out once, as this serves as proof for the initiative.One should be aware, though, that this method is riskier than using an opt-in, especially when many participants in a project are no longer active.If people miss the opportunity to opt-out (due to changed contact details, for example) and specifically do not want to be mentioned by name, this could lead to discontent.
For both the opt-in and the opt-out options, the option should remain for volunteers and their heirs to withdraw their names at a later point in time.Information about how they can do so should be available.In cases when someone requests withdrawal of their name from use, the name can no longer be used for future publications.However, the GDPR also allows for a request for data erasure.In these cases, the name should, if reasonably possible, also be removed from past publications.When doing so, it should be asked if deleting the name prevents the achievement of the goals of the publication and/or research.
As shown, acknowledging involved people is not a simple task, in some capacities almost an impossible task, and it requires action on many levels.A feasible and widespread approach for acknowledging has been provided within the frame of the CRediT taxonomy [Allen et al., 2014].Some journals, such as Science, already work with this model and add an acknowledgement  [31-10-2022] section to their article [Kestemont et al., 2022].The CRediT website states that it: '[...] grew from a practical realisation that bibliographic conventions for describing and listing authors on scholarly outputs are increasingly outdated and fail to represent the range of contributions that researchers make to published output.Furthermore, there is growing interest among researchers, funding agencies, academic institutions, editors, and publishers in increasing both the transparency and accessibility of research contributions.' 21The taxonomy lists, at the moment, fourteen different roles contributors could have, as indicated on the screenshot in 11 below.
While CRediT might look complicated, work on Ground Truth, data sets, or databases generally fits within the frame of data curation or resources.Being explicit about a person's role will not only help avoid confusion about their contribution, but also demonstrate the different kinds of contribution.When citizen scientists/volunteers are provided with a specific task (e.g.transcribing, correcting, or tagging texts), it could immediately be connected to one of the CRediT roles or tasks like data curation or resources.Regardless of their initial role, if the citizen scientists come across an exciting find that leads to specific research, an additional role could be assigned in consultation with the individual.From a legal perspective, one's role relates to one's potential author's rights. 22What makes the situation almost impossible to solve is the case if someone decides to withdraw their own name.In this circumstances, already reused data set Figure 11: CRediT taxonomy.https://credit.niso.org/will probably not alter the acknowledgement post-print.Thus, in case of later use one must check the original publication to make sure that no one gets mentioned that doesn't intend to be mentioned.

Acknowledging Institutional Activities: Digitisation Activity and Contextualisation
GLAM-sector institutions, but of course also private institutions, digitise their collections.Digitisation is a time-consuming and costly process that is, by now, part of their core business. 23It takes time, and this steadily paced process is only occasionally communicated to the outside world.From the researcher's perspective, communicating the relationship between the current version of the online collection and the offline archive is of great use, as it will support critical reflection on the possible methodological implications of the choices made in the digitisation process.Alternatively, a document or video explaining how subject categories, search fields, or filtering options were made/conceptualised can help clarify the (in)complete online collection.This document or video could provide crucial details contributing to the researchers' understanding of data provenance and archive structure and design.

Reflections, exports, and clarifying documentation
Researchers who regularly use digital resources have developed a critical perspective on collections and their provenance from archives, covering questions such as the selection of digitized data, physical aspects and others shown in 12 below.
The questions above are essential for researchers to perform a conceptual translation from the physical object to the digital collection, which is more than the inventory number in its context of origin (the archivists' concept of the word provenance).Adjusting to the new digital world requires technical skills and resources to set up an infrastructure that integrates characteristics archives are intended to guarantee: authenticity, reliability, integrity, and usability. 24Here, a This also raises the question of what has been digitised by a particular institution so far.An overview of what has been digitised should be available on the websites of GLAM institutions that digitise.Hauswedell et al. suggest that the institutional choices that went into choosing items for digitisation should be made clear to users [Hauswedell et al., 2020].Jensen suggests that digital archives could be encouraged to demonstrate the extent and content of their digitisation efforts [Jensen, 2021, 256].Here, she implicitly refers to the reliability of the found digitised document -how much of the inventory has been digitised (as a percentage; see e.g., fig.13) -but also, what type of datafication has been applied: has the entire text been described, or merely names and places?Is transcription ongoing (meaning that searches could give a different result if taking place days or weeks later)?If additional data has been created, those involved in that process should have the opportunity to be acknowledged, even if this is 'just' part of their job.Such tasks could be considered the modern equivalent of assembling or describing an archive, which is the traditional role of archivists [Jensen, 2021, 258].Though archivists are rarely credited for this work as individuals, the question is whether it would be helpful for both archivists and scholars to be named when part of digital projects, in a similar way to people who work on digital projects in academia.Having a credits list or page would give workers in an increasingly precarious labour market a way to highlight their skills and experience (and be cited for it), make digital labour more visible, and let people who use the resources know who to contact if they have any questions related to the resources.
Combining the additional data with descriptions based on predefined categories and structures could allow for different search methods and so extend users' freedom.It would create multiple entries that allow for differences and similarities between conceptual models found in the archive and researchers' (changing) conceptual models [Jensen, 2021, 257].Such room to manoeuvre is an asset to open and different interpretations without the apparent influence of the creators of such conceptual models.According to Jensen, this would or could result in different searches, including one targeting a range of related topics or production contexts.At the same time she highlights problems of bias, the historicity of the language as well as standardization that can cause problems for future historians [Jensen, 2021, 258-9].
A final concern voiced by Jensen is that: '[d]igitisation of archives depends on (additional) external funding, which means that they are likely to be subject to policies that emphasise popularity, marketisation, or current research trends ' [Jensen, 2021, 258-9].This concern could go two ways.On the one hand, one could argue that a selection bias based on the interests of funding individuals or institutions has been, and still is, also a problem of analogue archives.In other words, traditional archives require funding too, and the ones paying for them will necessarily influence the archive's contents.One could spin this thought out further and ask when the intentional omission of information starts (and where it will end).On the other hand, it has been argued that the digitisation of archives reduces selection bias.Based on experience from smalland large-scale digitisation projects and the literature [Jensen, 2021, 258-9], we do agree that it limits the selection bias, noting in particular political and infrastructural decisions.Digitisation is often a combination of a selection made by institutions and requests made by users (scanning on demand or asking for better searchability of a digitised source), but also the availability of equipment and (financial) means to carry out such work and make it accessible.In addition, whether digitisation leads to increased information transparency -due to less selection biasis up for discussion.For researchers with broad knowledge about an institution's collections, we nonetheless assume that educated conclusions about selection bias can be derived.Furthermore, based on the existence of certain materials online, it can also lead to more interest in certain documents or objects among the general public.By properly referencing resources, GLAM institutions can demonstrate the impact of their work, which may result in additional funding for digitizing more resources.

Digital images as proper objects
While digitised copies are distinct intellectual products from analogue materials, one should also be aware of possible discrepancies between digital and analogue versions, e.g.pages accidentally or intentionally not digitised, and (more or less) deliberate decisions on colouring and lighting, all leading to specific representations of objects that require critical approaches [Cordell, 2022].To differentiate between digital facsimiles and their physical objects, digitising institutions should provide explicit guidelines for how they want their digitised facsimiles to be referenced [Rueda et al., 2017]. 25ndependent of the scale of document digitisation, issues arise when indicating differences between the physical and the digital object.In most cases, non-persistent identifiers are used, referring to a URL that is tied to the technology used or the database system.This causes the risk of providing a link that is dead or, potentially worse, refers in the future to another object.Jensen, in the above-mentioned piece remarks that historians rarely disclose whether they accessed a physical or digitised version of their sources making us aware of the notion of discussing digital archives, as part of GLAM institutions [Jensen, 2021, 260].
While the content of the text might still be the same as the physical object, clouding the understanding as to why a different way of citing is needed, the digital form is not.This could have consequences for research focusing on materiality, as specific information (e.g.watermarks) can only be seen in the physical version and supported by specific infrastructure, and cannot be seen at all or can only be seen in a sub-optimal or skewed way in the digitised version.Nevertheless, the obvious pros of a digital version need to be brought forward, and enrichment of the data (e.g. in the form of Linked Open Data) can only be provided in a datafied version and not adequately in the physical object.
The digital turn in the humanities thus requires researchers to become more aware of their data's source and its materiality than ever before.The documentation of a method, including digital paths (proper PURL citations), is the reasonable course of action, and the only futureoriented one. 26While the International Image Interoperability Framework is of immense help for reusing images, the manifests used for this purpose are in themselves not enough to provide sustainability, since they can be changed at any time, and so do not provide the stability academic users seek [Padfield et al., 2022].Furthermore, several GLAM institutions even offer references within their digitised resources. 27t is thus strongly recommended that the entire GLAM sector becomes more aware of its crucial role in providing proper provenance data for digitised objects.While their core business towards physical objects is to store and preserve [Featherstone, 2006], the preservation of digital derivatives should -in our opinion -follow the same principles: authenticity, reliability, integrity and usability.28Through persistent identifiers, the GLAM sector could already guarantee authenticity and usability.At the same time, the reliability factor is partly met, but depends on integrity, which relies on the 'coherent picture'.
For clarity, the International Standard Identifier for Libraries and Related Organisations (ISIL) could, and perhaps should, be integrated with a persistent identifier, adding additional information concerning the responsible institutions.29This information could function as an 'authority label', guaranteeing authority and reliability.If that were to be used, the structure of the filenames could be as follows (see 14): Available transcriptions could follow the same structure but with a different extension and per-haps be followed by a number indicating a version.Under extreme circumstances, the above could also indicate if volunteers or researchers made a (less perfect) digital facsimile, as opposed to the official digitisation, which could potentially be helpful for GLAM institutions under threat or suffering damage.If and where possible, such a structure could be used to provide such versioned images within an IIIF-manifest. 30

V CONCLUSION AND RECOMMENDATIONS
We started our contribution by discussing the export and sharing of Ground Truth.However, with sharing comes caring: properly acknowledging who provided the data or models and who contributed to their creation.We have discussed the HTR-United initiative and shown how one can register available data sets on this platform.This platform functions as an 'umbrella' solution allowing contributors to use decentralised storage of their sources.At HTR-United, creators can be listed and metadata can be imported into Zotero for proper referencing.
Additionally, we addressed the challenges that subsequently emerged: the most effective ways to recognize the digitized sources utilized, which currently rely heavily on the author's provision of precise annotations.Referring to a website, however, is not enough; we have indicated the need for persistent identifiers, as well.A persistent identifier distinguishes the digitised collection from the physical objects, and, more importantly, preserves the main characteristics of archival guarantees: authenticity, reliability, integrity, and (re)usability. 31  Proper referencing of data sets and ATR models requires an overview of not only the underlying sources but also adequate acknowledgement of contributors.In addition, in the case of ATR models, information about the quality and the processing of both the training and validation sets should be provided.As this additional data is of great importance to future users, we propose working with a 'model card' to provide sufficient metadata for and contextualization of a model.To describe the role of contributors and distinguish the various roles they could have, this article has suggested CRediT (Contributor Roles Taxonomy), which allows researchers and projects to reference the work of volunteers/citizen scientists properly, if they agree to be mentioned.Although this is one example of how machine learning is being rolled out in the humanities, and in parallel in the library and archive community; the ongoing discussions demonstrate that we are only beginning to understand how best to share data, and to recognise contributions to shared data sets that underpin the artificial intelligence systems used in heritage contexts.We hope that this provides an example that can encourage others to consider these aspects within their infrastructures.

Figure 6 :
Figure 6: The example of the Ground Truth publication is of particular interest because it results from a multi-stage process and demonstrates reuse of data.See also footnote 8.

Figure 13 :
Figure 13: Section from the White House Central Files (WHCF) created or collected by President Lyndon B. Johnson and his staff, with an indication how much of the collection has been digitised.https://www.discoverlbj.org/exhibits/show/loh/pres/whcf