Signing data citations enables data verification and citation persistence

Commonly used data citation practices rely on unverifiable retrieval methods which are susceptible to content drift, which occurs when the data associated with an identifier have been allowed to change. Based on our earlier work on reliable dataset identifiers, we propose signed citations, i.e., customary data citations extended to also include a standards-based, verifiable, unique, and fixed-length digital content signature. We show that content signatures enable independent verification of the cited content and can improve the persistence of the citation. Because content signatures are location- and storage-medium-agnostic, cited data can be copied to new locations to ensure their persistence across current and future storage media and data networks. As a result, content signatures can be leveraged to help scalably store, locate, access, and independently verify content across new and existing data infrastructures. Content signatures can also be embedded inside content to create robust, distributed knowledge graphs that can be cited using a single signed citation. We describe applications of signed citations to solve real-world data collection, identification, and citation challenges.

expensive to preserve older versions as new versions are published. Alternatively, each image may be assigned its own DOI, requiring users to cite images individually or to cite a collection description that in turn cites each image by its DOI. This case demands a considerable amount of effort and commitment beyond the duration of the project to maintain active links (e.g., URLs) from each DOI to its associated image.
Instead, we make the Big-Bee corpus citable by publishing a registry of content signatures for images and associated metadata (including the locations of the images), such that a signed citation of the registry (10s of megabytes) recursively and robustly cites the entire Big-Bee image collection (potentially 100s of gigabytes). After accessing the registry, users can choose to locate and retrieve the subset of images they are interested in. When new images become available or change locations, the registry can be updated by publishing a new version that cites the previous version and appends the new information. Thus, snapshots of the Big-Bee collection may be cited and subsequently retrieved at any time without requiring any further commitment by the Big-Bee participants to maintain earlier versions of the collection (note that successive versions only add new images, and do not remove existing images). Additionally, when attempting to retrieve cited images, future versions of the registry (or even other independent registries) may be consulted to locate images that have changed locations but retained their unique content signatures.
Below, two versions of the Big Bee image corpus are cited using a single signature. Because the second version cites, contains, and extends the first version, only the content signature of the second version needs to be included in the citation. At least three copies of the corpus are kept at the Internet Archive, Zenodo, and a collection management system at the University of California at Santa Barbara's Cheadle Center for Biodiversity and Ecological Restoration (UCSB CCBER). 2. Versioning, citing, and retrieving biodiversity datasets registered With GBIF, iDigBio, and BioCASe Recursive signed citations were used to construct the Biodiversity Dataset Archive (Poelen & Elliott, 2021). The archive contains openly available biodiversity datasets from institutions around the world as listed by dataset registries hosted by GBIF (https://gbif.org), iDigBio (https://idigbio.org), and BioCASe (https://biocase.org). The archive also contains machine-readable provenance logs (generated using the Preston tool, much like example 3) that describe the discovery of registered biodiversity dataset URLs, attempts to retrieve datasets from those URLs, and the content signatures of all collected datasets and provenance logs. The information in the provenance logs forms a graph consisting of links between URLs, datasets, automated discovery and collection processes, and intermediate data collected by those processes (e.g., the dataset registries hosted by GBIF, iDigBio, and BioCASe), where all data are referenced using their content signatures. Consequently, all digital resources (datasets, provenance logs, and intermediate outputs of automated processes) in the graph are directly or indirectly referenced by the latest provenance log, such that a citation (e.g., example 3) of the latest provenance log recursively cites the entire graph and everything referenced by it. The graph was analyzed in Elliott et al. 2020 to collect statistics on how often the availability and content of biodiversity dataset URLs change over time, as indicated by changes in the content signatures associated with each URL in response to monthly download requests.
The citation of the large and complex data graph containing billions of versioned biodiversity records published independently across hundreds of institutions is similar to citing a single image. Below is a signed citation of a graph of biodiversity datasets registered with GBIF, iDigBio, and BioCASe. This is the 50 th version of an evolving graph; it cites, contains, and extends 49 earlier versions. Rather than computing a hash of the entire Z TB archive, we only need to compute the hash of the latest provenance log, which recursively and verifiably cites all other data in the graph by their SHA-256 content signatures. Poelen, J.H., Elliott, M.J. (2020). Biodiversity Dataset Archive. Hash://sha256/8aacce08462b87a345d271081783bdd999663ef90099212c8831db399fc0831b Accessed at https://archive.org/details/biodiversity-dataset-archives The exact version of the graph referenced in the citation can be located by entering the content signature into the search bar at zenodo.org (note that we had to indicate to Zenodo that this specific SHA-256 hash should be included in their search index). The graph (distributed across several provenance datasets) and the content it references can then be downloaded manually or using an automated tool such as Preston (Poelen et al., 2023). Once downloaded, the signatures of the data in the archiveas described in the provenance logs, starting with the one that was citedmay be reproduced using the SHA-256 algorithm to verify that the correct data has been retrieved.

Stabilizing URL references using Nomer's Corpus of Taxonomic Resources
Existing compute infrastructures and data processing software often use URLs to reference digital data. For instance, Nomer (Poelen & Salim, 2022) is a tool to help interpret taxonomic names using existing taxonomic resources (e.g., NCBI Taxonomy (Federhen, 2012)). Nomer uses URLs to point to externally published taxonomic resources (e.g., https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz). Because the content accessed via these URLs may change or become unavailable over time, the results produced by Nomer may change over time. To help make Nomer's results reproducible over time, the Nomer Corpus of Taxonomic Resources was created (Poelen, 2022). By using principles of content addressing in the form of content signatures, this corpus associates locations of taxonomic resources with their resolved digital content at several different observation times. Nomer has been extended to look up previously resolved content associated with a location as defined in a specific version of the taxonomic resource corpus to ensure that Nomer's results remain reproducible, even as the availability of taxonomic resources at their original Internet locations changes over time. Nomer's Corpus of Taxonomic Resources is cited below. We expect that having a reliable method to interpret taxonomic names will help to better understand and quantify the impact of different taxonomic interpretations on research outcomes. For instance, by directing Nomer to use a mapping of URLs to content signatures at a specific point in time, we are able reproduce Nomer's recognition of previously unknown differences in bee names between the specified versions of ITIS (ITIS 2004) and the Discover Life bee species guide and world checklist (Ascher & Pickering 2020).

Our use of Zenodo as a proof-of-concept content signature resolver
Our use of Zenodo for resolving MD5 and SHA-256 hashes to content made use of two unsupported features, and should not be expected to work as expected without official support. For resolving MD5 hashes, it was demonstrated in an online forum on Zenodo's GitHub project page (https://github.com/zenodo/zenodo/issues/1985) that Zenodo's search API allows users to query for content that match a supplied MD5 hash. The API can be used in this way by forming a URL with the following pattern: Because the web API uses a predictable format, users and automated services can effectively (with the understanding that it is unsupported) use Zenodo as an MD5 resolver for any content that they serve.
We also used Zenodo as a resolver for SHA-256 hashes, but using a different approach. Because Zenodo indexes the contents of each publication's title and readme, any hashes (preferably with the algorithm specified, e.g., using "content signatures") included in those fields will be made findable by their search API. For example, the following query reliably returns a link to a dataset identified by the content signature hash://sha256/450deb8ed9092ac9b2f0f31d3dcf4e2b9be003c460 df63dd6463d252bff37b55.

https://zenodo.org/search?q=450deb8ed9092ac9b2f0f31d3dcf4e2b9be003c460df63dd64 63d252bff37b55
Because the dataset's MD5 hash is also listed in the readme, the dataset can also be found by its MD5 hash using this method: https://zenodo.org/search?q=898a9c02bedccaea5434ee4c6d64b7a2 In general, any online content repository with a web API that allows searching by hash, and that produces predictably formatted outputs, can also act as a content resolver, usually with the restriction that only hashes of content housed in the content repository will be resolvable. In light of this, to help users resolve content, the Preston software simply queries several different repositories rather than requiring the user to specify which resolver (i.e., content signature registry) to use.

Enhancing existing content discovery and retrieval methods with signed citations
Content discovery (searching for data) and content retrieval (getting the data) methods are needed to make use of available digital content. A typical content discovery and retrieval use case can be exemplified as follows: 1. Dr. Yasmine Abbas articulates her search request and sends it to José Martinez, a trusted library agent. 2. After receiving and processing Dr. Abbas' request, Martinez compiles a list of relevant scholarly publications. 3. Dr. Abbas retrieves the first listed digital resource from the institutional digital archive. 4. After using the digital resource in her work, Dr. Abbas includes a citation to the resource in her reference list.
This sequence of steps can be robustly executed, documented, linked to the resulting data using signed citations as follows. In step 1, Dr. Abbas creates a search request, and then submits it to a library service. This search criteria is a small publication by Dr. Abbas that documents her interest. Let's say her request is captured in the following text document: Dr. Abbas requests references to digital images of the Bee genus Apis.
In step 2, Dr. Abbas receives the following result compiled by the library agent, Martinez, In response to "Dr. Abbas requests references to digital images of the Bee genus Apis", Jose Martinez located the following images after consulting "A biodiversity dataset graph: https://jhpoelen.nl/bees. 2020. hash://sha256/85138e506a29fb73099fb050372d8a379794ab57fe4bfdf141743db0de2b985c": along with a signed citation for the list, List of references to digital images of the Bee genus Apis compiled by Jose Martinez in response to Dr. Abbas' request: hash://sha256ca712a49029b4d956d1dbf4b32174ce2297d362118464b057d5d66afbc99767c where the hash of Dr. Abbas' request is included to verifiably link the Martinez' response to Dr. Abbas' request. Note that Martinez explicitly states which resource (a biodiversity dataset graph of bee-related records) was used to generate the list. If desired, Dr. Abbas can review and verify his results.
In step 3, Dr. Abbas sends another request to Martinez: After receiving the request, Martinez sends an exact digital copy of the requested content to Dr. Abbas. On receiving the digital content, Dr. Abbas successfully verifies the identity of the content by calculating its content hash and confirming that it matches the hash cited in her request. In step 4, Dr. Abbas uses the digital content in a publication, and includes a reference to the content in her list of references. Now that we've described a content discovery and retrieval scenario, we can articulate the benefits of incorporating content signatures into discovery and retrieval workflows: signatures allow data and their provenance to be reliably described so that, if desired, Dr. Abbas can independently verify the claims made by Martinez. The use of content-based identifiers allows the discovery and retrieval method to be independent of communication protocol, location, and time: 1. Communication protocol independentregardless of the communication method or storage medium, Dr. Abbas is able to verify the identity of the retrieved content by recalculating its signature. Also, Dr. Abbas is able to check whether Martinez correctly referenced her original request. In addition, Dr. Abbas can trace the provenance of the (possibly) subjective statements made by Martinez. In the absence of verifiable references such as content signatures, Dr. Abbas has to trust, but cannot verify, the authorship and provenance of Martinez' responses. 2. Location independentthe locations of Dr. Abbas, Martinez, and the referenced resources are irrelevant. Just as the location of a physical book does not change the content of the book, the contents of the verifiably identified requests and responses do not depend on where they happen to be created or stored. 3. Time independentsimilarly, the validity of the search results and retrieved content does not depend on the time of retrieval because the messages and relevant contents are identified using content-based signatures, and the result of the cryptographic function used to calculate the content hash does not depend on the time of the calculation.
Note that commonly used communication methods (e.g., HTTP, FTP, email), can still be used to implement discovery and retrieval methods that use content signatures. For example, requests and responses can be exchanged in HTTP messages, digital files, and electronic messages. Even analog communication methods (e.g., paper, spoken word) can be used, as long as a suitable, lossless encoding method is used to represent communicated content in a digital form and compute a content hash. Also note that Dr. Abbas and Martinez can represent human or machine actors.
Without content signatures, Dr. Abbas and Martinez have to trust, but cannot verify, the identities of the referenced (and retrieved) contents. Furthermore, without verifiable provenance, evaluating content's authenticity relies on subjective, often implicit assumptions of trust derived from, for example, reputation, personal experience, institutional affiliation, the domain name of the service, the delay between request and response, or the quality of the communication channel. Following our discussion of the World Wide Web earlier, we know that most, if not all of these assumptions cannot be easily verified using identifiers (e.g., URLs) that are prone to link rot and content drift, especially over long periods of time.