Decentralized provenance-aware publishing with nanopublications

argue that this architecture could be used as a low-level data publication layer to serve the Semantic Web in general. Our evaluation of the current network shows that this system is efficient and reliable


37
Modern science increasingly depends on datasets, which are however left out in the classical way of 38 publishing, i.e. through narrative (printed or online) articles in journals or conference proceedings. This 39 means that the publications describing scientific findings become disconnected from the data they are 40 based on, which can seriously impair the verifiability and reproducibility of their results. Addressing this 41 issue raises a number of practical problems: How should one publish scientific datasets and how can 42 one refer to them in the respective scientific publications? How can we be sure that the data will remain 43 available in the future and how can we be sure that data we find on the web have not been corrupted or 44 tampered with? Moreover, how can we refer to specific entries or subsets from large datasets, for instance,

Manuscript to be reviewed
Computer Science to support a specific argument or hypothesis? Figshare and Dryad. 1 Furthermore, Digital Object Identifiers (DOI) have been advocated to be used not 48 only for articles but also for scientific data (Paskin, 2005). While these approaches certainly improve 49 the situation of scientific data, in particular when combined with Semantic Web techniques, they have 50 nevertheless a number of drawbacks: They have centralized architectures, they give us no possibility to 51 check whether the data have been (deliberately or accidentally) modified, and they do not support access 52 or referencing on a more granular level than entire datasets (such as individual data entries). We argue that 53 the centralized nature of existing data repositories is inconsistent with the decentralized manner in which 54 science is typically performed, and that it has serious consequences with respect to reliability and trust. 55 The organizations running these platforms might at some point go bankrupt, be acquired by investors 56 who do not feel committed to the principles of science, or for other reasons become unable to keep their 57 websites up and running. Even though the open licenses enforced by these data repositories will probably 58 ensure that the datasets remain available at different places, there exist no standardized (i.e. automatable) 59 procedures to find these alternative locations and to decide whether they are trustworthy or not. 60 Even if we put aside these worst-case scenarios, websites have typically not a perfect uptime and 61 might be down for a few minutes or even hours every once in a while. This is certainly acceptable for 62 most use cases involving a human user accessing data from these websites, but it can quickly become a 63 problem in the case of automated access embedded in a larger service. Furthermore, it is possible that 64 somebody gains access to the repository's database and silently modifies part of the data, or that the data 65 get corrupted during the transfer from the server to the client. We can therefore never perfectly trust any 66 data we get, which significantly complicates the work of scientists and impedes the potential of fully 67 automatic analyses. Lastly, existing forms of data publishing have for the most part only one level at 68 which data is addressed and accessed: the level of entire datasets (sometimes split into a small number 69 of tables). It is in these cases not possible to refer to individual data entries or subsets in a way that is 70 standardized and retains the relevant metadata and provenance information. To illustrate this problem, let 71 us assume that we conduct an analysis using, say, 1000 individual data entries from each of three very 72 large datasets (containing, say, millions of data entries each). How can we now refer to exactly these 73 3000 entries to justify whatever conclusion we draw from them? The best thing we can currently do is to 74 republish these 3000 data entries as a new dataset and to refer to the large datasets as their origin. Apart 75 from the practical disadvantages of being forced to republish data just to refer to subsets of larger datasets, 76 other scientists need to either (blindly) trust us or go through the tedious process of semi-automatically 77 verifying that each of these entries indeed appears in one of the large datasets. Instead of republishing 78 the data, we could also try to describe the used subsets, e.g. in the form of SPARQL queries in the case 79 of RDF data, but this doesn't make it less tedious, keeping in mind that older versions of datasets are 80 typically not provided by public APIs such as SPARQL endpoints. 81 Below, we present an approach to tackle these problems, which builds upon existing Semantic Web 82 technologies, in particular RDF and nanopublications, adheres to accepted web principles, such as 83 decentralization and REST APIs, and supports the FAIR guiding principles of making scientific data 84 Findable, Accessible, Interoperable, and Reusable (Wilkinson et al., 2016). Specifically, our research 85 question is: Can we create a decentralized, reliable, and trustworthy system for publishing, retrieving, 86 and archiving Linked Data in the form of sets of nanopublications based on existing web standards and 87 infrastructure? It is important to note here that the word trustworthy has a broad meaning and there are 88 different kinds of trust involved when it comes to retrieving and using datasets from some third party. 89 When exploring existing datasets, a certain kind of trust is needed to decide whether an encountered 90 dataset is appropriate for the given purpose. A different kind of trust is needed to decide whether an 91 obtained file correctly represents a specific version of a specific dataset that has been chosen to be used.  Interestingly, the concept of nanopublications has also been taken up in the humanities, namely 125 in philosophy, 2 musicology (Freedman, 2014), and history/archaeology (Golden and Shaw, 2016). A 126 humanities dataset of facts is arguably more interpretive than a scientific dataset; relying, as it does, on 127 the scholarly interpretation of primary sources. Because of this condition, "facts" in humanities datasets 128 (such as prosopographies) have often been called "factoids" (Bradley, 2005), as they have to account 129 for a degree of uncertainty. Nanopublications, with their support for granular context and provenance 130 descriptions, offer a novel paradigm for publishing such factoids, by providing methods for representing 131 metadata about responsibilities and by enabling discussions and revisions beyond any single humanities 132 project.

133
Research Objects are an approach related to nanopublications, aiming to establish "self-contained 134 units of knowledge" (Belhajjame et al., 2012), and they constitute in a sense the antipode approach 135 to nanopublications. We could call them mega-publications, as they contain much more than a typical 136 narrative publication, namely resources like input and output data, workflow definitions, log files, and 137 presentation slides. We demonstrate in this paper, however, that bundling all resources of scientific studies 138 in large packages is not a necessity to ensure the availability of the involved resources and their robust 139 interlinking, but we can achieve that also with cryptographic identifiers and a decentralized architecture.

140
SPARQL is an important and popular technology to access and publish Linked Data, and it is both 141 a language to query RDF datasets (Harris and Seaborne, 2013) and a protocol to execute such queries 142 on a remote server over HTTP (Feigenbaum et al., 2013). Servers that provide the SPARQL protocol, 143 referred to as "SPARQL endpoints", are a technique for making Linked Data available on the web in a 144 flexible manner. While off-the-shelf triple stores can nowadays handle billions of triples or more, they 145 potentially require a significant amount of resources in the form of memory and processor time to execute 146 queries, at least if the full expressive power of the SPARQL language is supported. A recent study found 147 that more than half of the publicly accessible SPARQL endpoints are available less than 95% of the 148 time (Buil-Aranda et al., 2013), posing a major problem to services depending on them, in particular 149 to those that depend on several endpoints at the same time. To understand the consequences, imagine 150 one has to program a mildly time-critical service that depends on RDF data from, say, ten different 151 2 http://emto-nanopub.referata.com/wiki/EMTO Nanopub interface that causes heavy load in terms of memory and computing power on the side of the server.

155
Clients can request answers to very specific and complex queries they can freely define, all without paying 156 a cent for the service. This contrasts with almost all other HTTP interfaces, in which the server imposes 157 (in comparison to SPARQL) a highly limited interface, where the computational costs per request are 158 minimal.

159
To solve these and other problems, more light-weight interfaces were suggested, such as the read-write 160 Linked Data Platform interface (Speicher et al., 2015), the Triple Pattern Fragments interface (Verborgh 161 et al., 2014), as well as infrastructures to implement them, such as CumulusRDF (Ladwig and Harth,162 2011). These interfaces deliberately allow less expressive requests, such that the maximal cost of each 163 individual request can be bounded more strongly. More complex queries then need to be evaluated by 164 clients, which decompose them in simpler subqueries that the interface supports (Verborgh et al., 2014). 165 While this constitutes a scalability improvement (at the cost of, for instance, slower queries), it does not 166 necessarily lead to perfect uptimes, as servers can be down for other reasons than excessive workload. 167 We propose here to go one step further by relying on a decentralized network and by supporting only  (Hartig, 2013), but this is not a problem with the multi-layer architecture we propose below, 171 because querying is only performed at a higher level where these limitations do not apply.  The approach that we present below is based on previous work, in which we proposed trusty URIs to 186 make nanopublications and their entire reference trees verifiable and immutable by the use of cryptographic 187 hash values Dumontier, 2014, 2015). This is an example of such a trusty URI:  Furthermore, we argued in previous work that the assertion of a nanopublication need not be fully 198 formalized, but we can allow for informal or underspecified assertions (Kuhn et al., 2013), to deal with the 199 fact that the creation of accurate semantic representations can be too challenging or too time-consuming 200 for many scenarios and types of users. This is particularly the case for domains that lack ontologies 201 and standardized terminologies with sufficient coverage. These structured but informal statements are 202 supposed to provide a middle ground for the situations where fully formal statements are not feasible. We 203 proposed a controlled natural language (Kuhn, 2014) for these informal statements, which we called AIDA 204 (standing for the introduced restriction on English sentences to be atomic, independent, declarative, and 205 absolute), and we had shown before that controlled natural language can also serve in the fully formalized  Triple Pattern Fragments servers (provide/find/query data) proposed architecture: applications (analyze/use data) advanced services (query/analyze data) core services (find data) nanopublication server network (provide data) 1 Figure 1. Illustration of current architectures of Semantic Web applications and our proposed approach case as a user-friendly syntax for representing scientific facts (Kuhn et al., 2006). We also sketched how 207 "science bots" could autonomously produce and publish nanopublications, and how algorithms could 208 thereby be tightly linked to their generated data (Kuhn, 2015b), which requires the existence of a reliable 209 and trustworthy publishing system, such as the one we present here. that they all happen to be inaccessible at the same time is negligible, and that (2) these representations 231 are reasonably small, so that downloading them is a matter of fractions of a second, and so that one 232 has to process only a reasonable amount of data to decide which links to follow. We address the first server network on the given topic, and replace outdated nanopublications in its triple store with new ones.

276
A query request to this service, however, would not involve an immediate query to the underlying server 277 network, in the same way that a query to the Google search engine does not trigger a new crawl of the 278 web.

279
While the lowest layer would necessarily be accessible to everybody, some of the services on the 280 higher level could be private or limited to a small (possibly paying) user group. We have in particular 281 scientific data in mind, but we think that an architecture of this kind could also be used for Semantic Web 282 content in general.

284
As a concrete proposal of a low-level data provision layer, as explained above, we present here a 285 decentralized nanopublication server network with a REST API to provide and distribute nanopublications.

286
To ensure the immutability of these nanopublications and to guarantee the reliability of the system, these  Figure 2. Schematic representation of the decentralized server architecture. Nanopublications that have trusty URI identifiers can be uploaded to a server (or loaded from the local file system by the server administrator), and they are then distributed to the other servers of the network. They can then be retrieved from any of the servers, or from multiple servers simultaneously, even if the original server is not accessible.
consequences for its design: The first benefit is that the fact that nanopublications are always small (by 294 definition) makes it easy to estimate how much time is needed to process an entity (such as validating 295 its hash) and how much space to store it (e.g. as a serialized RDF string in a database). Moreover it 296 ensures that these processing times remain mostly in the fraction-of-a-second range, guaranteeing that 297 responses are always quick, and that these entities are never too large to be analyzed in memory. The  Specifically, a nanopublication server of the current network has the following components:

308
• A key-value store of its nanopublications (with the artifact code from the trusty URI as the key)

309
• A long list of all stored nanopublications, in the order they were loaded at the given server. 310 We call this list the server's journal, and it consists of a journal identifier and the sequence of The server network can be seen as an unstructured peer-to-peer network, where each node can freely 320 decide which other nodes to connect to and which nanopublications to replicate.

321
The URI pattern and the hash pattern of a server define the surface features of the nanopublications that

Manuscript to be reviewed
Computer Science in nanopublications whose hash in the trusty URI start with one of the specified character sequences 327 (separated by blank spaces). As hashes are represented in Base64 notation, this particular hash pattern 328 would let a server replicate about 0.05% of all nanopublications. Nanopublication servers are thereby 329 given the opportunity to declare which subset of nanopublications they replicate, and need to connect 330 only to those other servers whose subsets overlap. To decide on whether a nanopublication belongs to a 331 specified subset or not, the server only has to apply string matching at two given starting points of the

346
• A journal page can be requested by page number as a list of trusty URIs.

347
• For every journal page (except for incomplete last pages), a gzipped package can be requested 348 containing the respective nanopublications.

349
• The list of known peers can be requested as a list of URLs.

350
In addition, a server can optionally support the following two actions (in the form of HTTP POST 351 requests):

352
• A server may accept requests to add a given individual nanopublication to its database.

353
• A server may also accept requests to add the URL of a new nanopublication server to its peer 1. The latest server information is retrieved from p, which includes its list of known peers P p , the 365 number of stored nanopublications n p , the journal identifier j p , the server's URI pattern U p , and its 366 hash pattern H p .

367
2. All entries in P p that are not yet on the visiting server's own list of known peers P s are added to P s .  To make the infrastructure described above practically useful, we have to introduce the concept of indexes.

404
One of the core ideas behind nanopublications is that each of them is a tiny atomic piece of data. This 405 implies that analyses will mostly involve more than just one nanopublication and typically a large number 406 of them. Similarly, most processes will generate more than just one nanopublication, possibly thousands 407 or even millions of them. Therefore, we need to be able to group nanopublications and to identify and use 408 large collections of them.   The actual claim or hypothesis of the nanopublication goes into the assertion graph: The lines above constitute a very simple but complete nanopublication. Again, after just a few minutes this nanopublication will be distributed in the network and available on

Computer Science
The generated index is stored in the file index.cdkn2a-nanopubs.trig, and our exemplary researcher There is no need to publish the five nanopublications this index is referring to, because they are already 571 public (this is how we got them in the first place). The index URI can now be used to refer to this new 572 collection of existing nanopublications in an unambiguous and reliable manner. This URI can be included 573 in the scientific publication that explains the new finding, for example with a reference like the following: In this case with just five nanopublications, one might as well refer to them individually, but this is 577 obviously not an option for cases where we have hundreds or thousands of them. The given web link 578 allows everybody to retrieve the respective nanopublications via the server np.inn.ac. The URL will 579 not resolve should the server be temporarily or permanently down, but because it is a trusty URI we can 580 retrieve the nanopublications from any other server of the network following a well-defined protocol 581 (basically just extracting the artifact code, i.e. the last 45 characters, and appending it to the URL of 582 another nanopublication server). This reference is therefore much more reliable and more robust than 583 links to other types of data repositories. In fact, we refer to the datasets we use in this publication for  The new finding that was deduced from the given five nanopublications can, of course, also be 588 published as a nanopublication, with a reference to the given index URI in the provenance part: We can again transform it to a trusty nanopublication, and then publish it as above.

606
Some of the features of the presented command-line interface are made available through a web 607 interface for dealing with nanopublications that is shown in Figure 4. The supported features include 608 the generation of trusty URIs, as well as the publication and retrieval of nanopublications. The interface 609 allows us to retrieve, for example, the nanopublication we just generated and published above, even though 610 we used an example.org URI, which is not directly resolvable. Unless it is just about toy examples, we 611 should of course try to use resolvable URIs, but with our decentralized network we can retrieve the data 612 even if the original link is no longer functioning or temporarily broken.

614
To evaluate our approach, we want to find out whether a small server network run on normal web servers,  In the second part of the evaluation we expose a server to heavy load from clients to test its retrieval 641 capacity. For this we use a service called Load Impact 8 to let up to 100 clients access a nanopublication 642 server in parallel. We test the server in Zurich over a time of five minutes under the load from a linearly 643 5 https://github.com/tkuhn/nanopub-monitor 6 https://bitbucket.org/tkuhn/trustypublishing-study/ 7 https://bitbucket.org/tkuhn/trustypublishingx-study/ 8 https://loadimpact.com interface. This comparison is admittedly not a fair one, as SPARQL endpoints are much more powerful 651 and are not tailor-made for the retrieval of nanopublications, but they provide nevertheless a valuable and 652 well-established reference point to evaluate the performance of our system.

653
While the second part of the evaluation focuses on the server perspective, the third part considers the 654 client side. In this last part, we want to test whether the retrieval of an entire dataset in a parallel fashion 655 from the different servers of the network is indeed efficient and reliable. We decided to use a medium-sized 656 dataset and chose LIDDI (NP Index RA7SuQ0e66, 2015), which consists of around 100,000 triples. We In addition, we wanted to test the retrieval in a situation where the internet connection and/or the 661 nanopublication servers are highly unreliable. For that, we implemented a version of an input stream that 662 introduces errors to simulate such unreliable connections or servers. With a given probability (set to 1% 663 for this evaluation), each read attempt to the input stream (a single read attempt typically asking for about 664 8000 bytes) either leads to a randomly changed byte or to an exception being thrown after a delay of 5 665 seconds (both having an equal chance of occurring of 0.5%). This behavior can be achieved with the 666 following command, which is obviously only useful for testing purposes:  Figure 6. This diagram shows the rate at which nanopublications are loaded at their first, second, and third server, respectively, over the time of the evaluation. At the first server, nanopublications are loaded from the local file system, whereas at the second and third server they are retrieved via the server network.
into the amount of time this retrieval operation takes, and the number of times the retrieval of a single 673 nanopublication from a server fails and has to be repeated.

675
The first part of the evaluation lasted 13 hours and 21 minutes, at which point all nanopublications were 676 replicated on all three servers, and therefore the nanopublication traffic came to an end. Figure 6 shows 677 the rate at which the nanopublications were loaded at their first, second, and third server, respectively.   For the third part of the evaluation, all forty retrieval attempts succeeded. After normalization of 701 the downloaded datasets, they were all identical, also the ones that were downloaded through an input 702 stream that was artificially made highly unreliable. Figure 9 shows the number of retrieval failures and fewer than 10 such download attempts failed in 18 of the 20 test runs. In the remaining two runs, the 707 connection happened to be temporarily unreliable for "natural" reasons, and the number of download 708 failures rose to 181 and 9458, respectively. This, however, had no effect on the success of the download in 709 a timely manner. On average over the 20 test runs, the entire dataset was successfully downloaded in 235 710 seconds, with a maximum of 279 seconds. Unsurprisingly, the unreliable connection leads a much larger 711 average number of failures and retries, but these failures have no effect on the final downloaded dataset, 712 as we have seen above. On average, 2486 download attempts failed and had to be retried in the unreliable 713 setting. In particular because half of these failures included a delay of 5 seconds, the download times are 714 more than doubled, but still in a very reasonable range with an average of 517 seconds and a maximum 715 below 10 minutes.

716
In summary, the first part of the evaluation shows that the overall replication capacity of the current 717 server network is around 9.4 million new nanopublications per day or 3.4 billion per year. The results of 718 the second part show that the load on a server when measured as response times is barely noticeable for up 719 to 50 parallel clients, and therefore the network can easily handle 50 · x parallel client connections or more, 720 where x is the number of independent physical servers in the network (currently x = 10). The second 721 part thereby also shows that the restriction of avoiding parallel outgoing connections for the replication 722 between servers is actually a very conservative measure that could be relaxed, if needed, to allow for a 723 higher replication capacity. The third part of the evaluation shows that the client-side retrieval of entire 724 datasets is indeed efficient and reliable, even if the used internet connection or some servers in the network 725 are highly unreliable. 727 We have presented here a low-level infrastructure for data sharing, which is just one piece of a bigger 728 ecosystem to be established. The implementation of components that rely on this low-level data sharing 729 infrastructure is ongoing and future work. This includes the development of "core services" (see Section  Apart from that, we also have to scale up the current small network. As our protocol only allows for 736 simple key-based lookup, the time complexity for all types of requests is sublinear and therefore scales up 737 well. The main limiting factor is disk space, which is relatively cheap and easy to add. Still, the servers 738 will have to specialize even more, i.e. replicate only a part of all nanopublications, in order to handle really

Manuscript to be reviewed
Computer Science large amounts of data. In addition to the current surface feature definitions via URI and hash patterns, a 740 number of additional ways of specializing are possible in the future: Servers can restrict themselves to 741 particular types of nanopublications, e.g. to specific topics or authors, and communicate this to the network 742 in a similar way as they do it now with URI and hash patterns; inspired by the Bitcoin system, certain 743 servers could only accept nanopublications whose hash starts with a given number of zero bits, which 744 makes it costly to publish; and some servers could be specialized to new nanopublications, providing 745 fast access but only for a restricted time, while others could take care of archiving old nanopublications, 746 possibly on tape and with considerable delays between request and delivery. Lastly, there could also 747 emerge interesting synergies with novel approaches to internet networking, such as Content-Centric 748 Networking (Jacobson et al., 2012), with which -consistent with our proposal -requests are based on 749 content rather than hosts. 750 We argue that data publishing and archiving can and should be done in a decentralized manner. We 751 believe that the presented server network can serve as a solid basis for semantic publishing, and possibly 752 also for the Semantic Web in general. It could contribute to improve the availability and reproducibility of 753 scientific results and put a reliable and trustworthy layer underneath the Semantic Web.