A Decentralized Architecture for Sharing and Querying Semantic Data

. Although the Semantic Web in principle provides access to a vast Web of interlinked data, the full potential currently remains mostly unexploited. One of the main reasons is the fact that the architecture of the current Web of Data relies on a set of servers providing access to the data. These servers represent bottlenecks and single points of failure that result in instability and unavailability of data at certain points in time. In this paper, we therefore propose a decentralized architecture ( Piqnic ) for sharing and querying semantic data. By combining both client and server functionality at each participating node and introducing replication, Piqnic avoids bottlenecks and keeps datasets available and queryable although the original source might (temporarily) not be available. Our experimental results using a standard benchmark of real datasets show that Piqnic can serve as an architecture for sharing and querying semantic data, even in the presence of node failures.


Introduction
More and more datasets are being published in RDF format. These datasets cover a broad range of topics, such as geography, cross-domain knowledge, government, life sciences, etc. Access to these datasets is offered in different ways, e.g., they can be downloaded as data dumps, they can be queried via SPARQL endpoints, or they can be "browsed" via dereferencing URIs.
Once published, however, we are often in a situation where the datasets, or rather the interfaces to access them, are not available when needed. In fact, studies found that over half of the public SPARQL endpoints have less than 95% availability [2]. The reason often simply is that maintaining these interfaces requires considerable resources from the data providers. In practice, this means that the data necessary to answer a certain query might not be available at a specific time so that the answer might be incomplete -or in general, the same query might have different answers at different points in time.
Hence, despite the great potential of the Semantic Web, accessing RDF datasets today entirely relies on the services offered by the data providers, e.g., web interfaces with downloadable datasets, SPARQL endpoints, or dereferenceable URIs. Especially SPARQL endpoints often require huge amounts of resources for query processing, which further increases the burden on the data providers [9,20]. Despite recent efforts that proposed to implement monetary incentives to solve this problem [8], we argue that we can achieve availability by applying decentralization instead of relying on the availability of single servers and their functionality. This not only better reflects the nature of the World Wide Web but also avoids dependencies and single points of failure.
In this paper, we therefore propose Piqnic (a P2p system for Query processiNg over semantIC data). Piqnic introduces decentralization as a key concept by building on the Peer-to-Peer (P2P) paradigm and replication. Piqnic functions as a P2P network of homogeneous clients that can be queried by any node in the network. By combining both client and server functionality at each peer and introducing replicas, we avoid single points of failure as (sub)queries can be processed by multiple alternative peers and the data is still available even though the original source is not. In doing so, Piqnic offers solutions to two of the main problems that the current Semantic Web is suffering from: high query loads at the data provider's site (SPARQL endpoints) [20] and availability of datasets [2]. In summary, this paper makes the following contributions: -A P2P-based architecture for publishing and querying RDF data (Piqnic) -A customizable scheme for replicating and fragmenting datasets -Query processing strategies in Piqnic networks with replicated and fragmented data -An extensive evaluation of the the proposed approaches This paper is structured as follows. While Section 2 discusses related work, Section 3 presents the Piqnic framework and its main concepts. Section 4 then describes how to process queries in Piqnic. Section 5 then presents the results of our evaluation and Section 6 concludes the paper with a summary and an outlook to future work.

Related Work
Recent developments in privacy and personal data on the social Web has inspired interesting new applications and use cases. The Solid project [13], for instance, uses a decentralized architecture and Semantic Web technologies to enable personal online datastores (pod) to be stored separately from applications. In fact, users decide themselves where a pod is hosted, giving them control over their data. While the idea of storing Linked Data in multiple locations is central to our work, Solid focuses on privacy protection of personal data whereas we focus on the availability of open datasets.
Federated query processing over SPARQL endpoints is a widely used approach to query over distributed Linked Open Data. To lower the computational load at the servers hosting the SPARQL endpoints, recent proposals, such as Triple Pattern Fragments (TPF) [20] and Bindings-Restricted Triple Pattern Fragments (brTPF) [9], propose to shift part of the load to the client issuing the query. This, in turn, increases the availability of the servers. Nevertheless, TPF/brTPF servers still represent a single point of failure; if the server is not available, the hosted datasets are not available either, which is the problem we are targeting in this paper.
To further share the computational load in a TPF setting, processing SPARQL queries in networks of browsers has been proposed [6,7,16]. The key principle is to share the computational load among a set of clients based on the functionality offered by their browsers and caching of recently used datasets using a collaborative caching system based on overlay networks [5]. However, browsers are relatively unstable nodes with very limited processing power and storage capacity, which naturally limits the general applicability. In contrast, we aim at a relatively stable network with more powerful nodes and split datasets into smaller fragments that are replicated.
Replication of triple pattern fragments has been considered in [15], where fragments are replicated at multiple servers to allow for balancing the server loads and providing fault tolerance. While [15] considers a fixed set of servers that provide access to a fixed set of replicated fragments, and clients that are aware of the allocation of fragments to servers, Piqnic has a fault-tolerant P2Pbased architecture where clients also serve replicated fragments to other clients, which naturally allows for handling dynamic behavior of the clients.
P2P systems in general vary in their level of decentralization. Structured P2P systems organize their peers in an overlay network using, for instance, Distributed Hash Tables (DHTs) to decide were to store and find particular data items. Some of these systems were proposed to support RDF data [3,10,11].
The key principle of these systems is that the connections between peers, i.e., the layout of the network, and the data placement is imposed on the participating peers -restricting their autonomy. As a consequence, such systems are vulnerable to situations where many peers leave and join the network as this might require major reorganizations of structure and data placement in the network.
Unstructured P2P systems, on the other hand, retain a high degree of their peer's autonomy, i.e., there is no globally enforced network layout or data placement. The basic way of processing queries in such networks is flooding, i.e, a request is flooded through the network along the connections between neighboring peers until an answer has been found. These systems are therefore more reliable with respect to dynamic behavior, i.e., clients joining and leaving the network. The prospects of unstructured P2P techniques as a decentralized architecture for Linked Data have also been recognized in recent vision papers [14,17]. While these papers provide interesting insights in the benefits of decentralization and replication, we propose a concrete system and implementation for query processing over a network of unstructured P2P clients.

PIQNIC
Piqnic builds upon basic principles of P2P systems; a client software is running at each participating peer that (i) provides access to a network of clients without central authority and (ii) offers access to locally stored datasets at other clients in the network. To minimize local space consumption at a client, we use HDT [4] files. As common in P2P networks, clients do not have global knowledge of all the peers in the network and their connections. Hence, Piqnic clients always maintain a partial view of the entire network. This partial view consists of (i) nodes with related data, i.e. data that use common URI/IRIs, to ensure that queries over multiple datasets can be completed efficiently and (ii) random neighbors to ensure connectivity of the entire network. In the following, we use the terms client and node interchangeably.
Before going into details on query processing (Section 4), this section first introduces the notion of datasets and data fragments (Section 3.1). Afterwards, Section 3.2 outlines Piqnic's network architecture. Last, Section 3.3 describes the dynamic behavior of Piqnic nodes, maintaining a partial view over the network, and data replication.

Data Fragmentation
Since RDF datasets can be quite large (e.g., YAGO3 [12] with over 100 million triples), replicating entire RDF datasets at another node might not always be possible or useful. Hence, inspired by the TPF-style of accessing data, we propose a customizable approach for fragmenting large datasets.
Consider the infinite and disjoint sets U (the set of all URIs/IRIs), B (the set of all blank nodes), L (the set of all literals), and V (the set of all variables).
Definition 1 (Fragment). Let G N be a knowledge graph that includes all RDF triples in a Piqnic-network. A fragment f is a 4-tuple f = T, N, u, i with the following elements: • T is a finite set of RDF triples, and T ⊆ G N , • N is a set of Piqnic nodes containing the fragment, • u is a URI/IRI that identifies the fragment, and • i is an identification function that determines whether the fragment contains triples matching a given triple pattern.
Identification functions are mainly used during query processing to determine whether or not a triple pattern should be evaluated over a fragment. Following the principle that a data provider uploads a dataset using a local Piqnic client, we say that datasets are "owned" by a specific node. The owner node manages the allocation of replicas to other nodes in the network. For more details on this, please see Section 3.3. We then define a dataset as a set of fragments: Definition 2 (Dataset). A dataset D is a triple D = F, u, o with the following elements: • F is a set of fragments, • u is a URI/IRI that identifies the dataset, and • o is an identifier of the "owner" node, i.e., the node that uploads F to the network.

4
A fragmentation function F is then defined as follows.
Definition 3 (Fragmentation function). A fragmentation function F is a function that, when applied to a knowledge graph G, creates a set of fragments Concrete fragmentation functions can result in different levels of granularity. For example, a fragmentation function F C that results in F C (G) = {G} is a very coarse-granular fragmentation function that does not split up the original knowledge graph G. On the other hand, a fragmentation function F F that results in F F (G) = {{t} | t ∈ G}, creates a separate fragment for each individual triple is very fine-granular. Piqnic uses of a predicate-based fragmentation function F P as defined in Definition 4.
Naturally, more complex fragmentation functions can be defined. However, in its current implementation Piqnic uses F P because it has a straightforward implementation and is guaranteed to generate pairwise disjoint fragments as each triple has exactly one predicate, i.e., for any two fragments Example 1 (Fragmentation). Consider example knowledge graph G E in Table 1a.
Applying F P to G E results in the set of fragments f 1 , f 2 , f 3 f 4 , and f 5 shown in Table 1b: one fragment for each unique predicate p 1 , p 2 , p 3 , p 4 , and p 5 .

Network Architecture
A Piqnic network consists of a set of interconnected nodes, each maintaining a local data/triple store to manage a set of fragments. A node is defined as follows.
Definition 5 (Node). A node n is a triple n = Γ, ∆, N where • Γ is the set of fragments located on the node, • ∆ is a set of datasets owned by the node, and • N is a set of so-called neighbor nodes in the network.
Each node n maintains a set n.N of neighbor nodes representing a partial view over the network. In order to assure that (i) related data is close in the network to increase the completeness of query answers, and (ii) all data and nodes can be reached (connectivity of the network), n.N contains nodes with related fragments as well as random nodes in the network.
To account for changes in the network, Piqnic uses periodic shuffles [21] between pairs of nodes. A node n selects a random node n in n.N , which it sends a subset of its neighbors removing them from its own partial view. This subset consists of the least related neighbors based on the "joinability" of the nodes' fragments.
Definition 6 (Fragment Joinability). Let s t and o t be the subject and object of triple t, G N the knowledge graph containing all RDF triples in a network, and We observe that the binary relation ⊥ ⊥ is symmetric and reflexive. It is symmetric since if t 1 has a subject or object in common with t 2 , t 2 has the same subject or object in common with t 1 . It is reflexive since any triple t has its own subjects and objects in common with itself.
Fragment joinability only considers if two fragments are joinable, and does not consider the rate of overlap between them. This is to avoid favoring large fragments where the absolute number of joint subjects and objects is likely to be higher than for small fragments because of the higher number of triples. The relative number of overlapping subjects and objects is not a good alternative either as fragments with a small overlap might still be important to achieve complete query results.
Based on Definition 6, we can now define a relatedness metric to rank a node's neighbors. We consider only non-identical joinable fragments. Hence, given a node n the goal is to select the k least related nodes R, where R ⊆ n.N s.t. we minimize the objective function in Equation 1.

Rel(n) = arg min
where Join(n, n i ), as defined in Equation 2, is the set of fragments in n that are joinable with one of node n i 's fragments that does not have the same fragment identifier.
To compute relatedness in a running system, the nodes exchange the sets of objects and subjects in a compressed representation, such as bitvectors, which can be stored locally for future use. f4, f5

Replication of Datasets
Any node participating in a Piqnic network can upload a dataset and become its owner node. When uploading a knowledge graph G, a fragmentation function (F P (G)) is applied to obtain a set of fragments. This set of fragments is then used to create a dataset D.
Allocation of fragments in a Piqnic network follows a chaining approach, i.e., the owner node passes the fragment on to one of its neighbors, which inserts the fragment into its own local data store and forwards the fragment to one of its neighbors. This continues for a certain number of steps, referred to as replication factor (r f ). If a node cannot insert a fragment (for instance because of too little available storage space), it returns one of its neighbors to the previous nodes. Lastly, the set of nodes at which the fragment has been inserted is returned to the owner node.  Table 1b. Suppose c 2 wants to allocate f 3 with r f = 1 and selects neighbor c 1 . f 3 is then forwarded to c 1 with r f = r f − 1 = 0. f 3 is therefore inserted into c 1 's local data store, resulting in Figure 2b. {c 1 } is then returned to c 2 as the set of nodes in which f 3 has been inserted.  f1 : a, p1, b , b, p1, d , d, p1, c  f2 : a, p2, c , b, p2, e , c, p2, a   f2 : a, p2, c , b, p2, e , c, p2  If a node containing a fragment from D.F fails, the owner will allocate the fragment to another node, ensuring the continued availability of the fragment. If the owner itself fails, another node can take over the task of maintaining availability.
Besides making sure fragments are always available, Piqnic exposes the following operations, which the owner of a dataset D can execute: (i) add triples to fragments in D, (ii) remove triples from fragments in D, (iii) allocate fragments to further nodes, and (iv) revoke an allocation of a fragment from a node. This update is executed locally on the owner node, after which it forwards the updated fragment to the nodes it is allocated to.
Joining a Piqnic network can be achieved by knowing an arbitrary node and making it a neighbor. Consider, for instance, a node n 1 wants to join the network via node n 2 ; n 1 therefore sends a message to n 2 , which replies with a subset of its neighbors. n 1 will then take over some replicas of these neighbors and gradually become a full member of the network.

Query Processing
Any node in a Piqnic network can issue queries. Query processing follows the basic principle of flooding that is employed in P2P systems [1], i.e., a query is forwarded to a peer's neighbors, which in turn forward it to their neighbors until a certain Time-To-Live (TTL) value/distance is reached. In Piqnic, a SPARQL query q at a node n i is processed in the following steps: 1. Estimate the cardinality of each triple pattern in q using variable counting. The order in which triple patterns are processed is determined by this estimation, i.e., most selective triple patterns are evaluated first. 2. Evaluate q's triple patterns, starting from n i 's local datastore, over the data accessible via n i 's neighbors by flooding the network using a specified TTL value. 3. Receive partial results from the queried nodes in the network (only nodes with results reply).

Compute the final query result by combining the intermediate results of the
triple patterns and the remaining operations necessary to complete q. We use fragment identifiers to avoid querying the same fragment twice on different nodes. Moreover, if a fragment is available locally, we use that and do not query it again on another node.
Obviously, step 2 can be implemented in different ways. But before going into details on this aspect, let us first define an identification function (i in the Definition 1) to decide whether a fragment is relevant for a particular triple pattern or not. As fragmentation is defined on predicates, we use a predicatebased identification function.
Definition 7 (Predicate-based identification function). Let F P (G N ) be a the set of fragments in a network, f ∈ F P (G N ) be a fragment, tp be a triple pattern and p tp the predicate of tp. A predicate-based identification function F IP (f, tp) returns true iff ∀t ∈ f : p t = p tp or p tp is a variable. We implemented and evaluated three query processing strategies that differ in step 2 of the above description: Single, Bulk, and Full.

Single Strategy
The Single approach is inspired by query processing in TPF [20] and Jena ARQ 1 . The triple patterns of a query q are processed sequentially. To process the current triple pattern we use the intermediate results from the node's local fragments (step 1) and previously computed triple patterns. We instantiate the triple pattern with the already known result mapping and send it to the neighbors. This is done for each known solution mapping separately, hence the name of this strategy: Single.
Bulk Strategy Obviously, the Single strategy can be improved by sending sets of solution mappings along with a triple pattern throughout the network instead of individual solution mappings. This is a similar optimization as proposed in [9] to improve TPF query processing. Hence, using the Bulk strategy we expect that considerably fewer messages are sent throughout the network. Ideally, all bindings are sent along in a single message. However, for some triple patterns there is a high number of intermediate solution mappings, e.g., query L1 in our evaluation has more than 232, 000 solution mappings for the variable ?results.
Propagating such a large number of bindings through the network might easily become a problem because of the message size. Hence, in such cases, we send the bindings in groups of up to s m bindings. In our current implementation, we use a default value of s m = 1, 000 (empirically determined based on the data and queries used in our experiments).
Full Strategy In contrast to the other strategies, the Full strategy does not include the results of already computed solution mappings in the queries sent throughout the network. Instead, it forwards the triple patterns as defined in the original input query to the neighbors and exploits the fact that this can be done in parallel (instead of sequentially as in the other strategies). However, as this strategy cannot exploit the selectivity of triple patterns if instantiated with solution bindings, more data has be sent throughout the network. Likewise, more data has to be processed locally at the querying node to compute the final result.

Evaluation
We implemented a prototype Piqnic client 2 in Java 8 using the HDT Java library 3 for the local datastore and extended Apache Jena 4 to support the three query processing approaches discussed in Section 4.

Experimental Setup
We ran our experiments on a server with 4xAMD Opteron 6376, 16 core processors at 2.3GHz, 768KB L1 cache, 16MB L2 cache and 16MB L3 cache each (64 cores in total), and 516GB RAM. To evaluate our approach we used LargeRDFBench [18], which extends FedBench [19] with additional datasets and queries. LargeRDFBench comes with 13 datasets with altogether over 1 billion triples and was designed to evaluate federated SPARQL query processing engines. LargeRDFBench provides a total of 40 queries divided into four distinct sets: simple (S), complex (C), large data (L), and Complex and large data (CH). However, to enable a more fine-granular analysis, we distinguish the two subsets of S that were originally defined in FedBench but merged together in LargeRDFBench: cross domain (CD) and life sciences (LS).
In our experiments, we varied a broad range of parameters. However, due to space restrictions we do not show all experimental results in this paper but focus on a subset. More evaluation results are available on our website 5 . For each experiment, we measured the following metrics: • Query Execution Time (QET) is the amount of time it takes to answer a query, i.e., the time elapsed between issuing the query and obtaining the final answer. • Completeness (COM) measures how complete the computed set of answers for a query is; expressed as the percentage of computed answers in comparison to the complete set of answers. • Number of Transferred Bytes (NBT) is the total number of bytes transferred between nodes during query execution. • Number of Messages (NM) is the total number of messages exchanged between nodes during query execution. Each experiment was run as follows: all queries in the query load were executed 3 times at randomly chosen clients in the network -the reported measurements represent the averages of the 3 executions.
Performance of query execution strategies Since Full timed out for all queries in groups L and CH and all but a few queries in C, and Single timed out for most queries in the aforementioned groups, in this set of experiments we focus our discussion on query groups CD and LS -the omitted results can be found on our website. The corresponding query execution times (QET) are shown in Figure 3.
As we can clearly see, in general Bulk performs much better than the other two approaches with respect to execution time. It is not surprising that Bulk in general performs better than Single as sending groups of bindings instead of sending each binding separately considerably reduces communication and computational overhead. While Full does perform quite poorly in most cases, it is faster than both Bulk and Single in rare cases, e.g., LS7. This is due to the fact that some triple patterns have a very low selectivity. In such cases, almost all triples in a fragment are relevant and it is therefore more efficient to download the entire fragment instead of exchanging multiple rounds of messages with large amounts of data. This is evident from Figure 4, which shows the number of messages sent through the network. Not surprisingly, in all cases Single sends more messages throughout the network than the two other approaches. While expectedly Full is better in this regard than Bulk, they are still quite similar in most cases. This is due to the cardinalities of most triple patterns being lower than 1,000, and thus only one message is sent. The queries that did not time out all delivered complete query results (100% completeness). Figure 5 shows the number of transferred bytes (NTB) for queries in groups CD and LS. Aqain, we leave out queries that timed out. Not surprisingly, Full  transfers the most amount of data in all cases. For most queries Single and Bulk are comparable and the differences negligible.
Robustness of the network We have also evaluated how Piqnic networks perform in the presence of node failures using the Bulk strategy. Hence, to test robustness and availability of data, we focused on the Bulk strategy and the query sets CD and LS. Figure 6 shows the average completeness of queries with a varying number of failing nodes. In the experiment, we executed all the queries and noted the completeness, i.e., we gradually killed a randomly selected number of nodes until no nodes were left in the network. The network was given no recovery time after nodes had been killed. The results show the robustness of Piqnic against node failures; the results start with a completeness of 100% and stayed above 90% until less than 60% of the nodes were running. Afterwards, the completeness gradually decreased. When giving the network recovery time between each run, i.e., allowing all  nodes to perform 3 shuffles before the next set of nodes is killed, Piqnic is able to keep the completeness close to 100% (the lowest was 94,44% and was due to a single query not being answered) even when 50% of the nodes failed. However, we should mention that query execution time was affected since each node has more fragments to look through. This shows that Piqnic is able to keep data available through replication at the tradeoff of increased execution time as it is more expensive to find the relevant fragment.
Impact of Time-To-Live Intuitively, a higher TTL value gives access to a larger part of the network. However, a large TTL value also means sending messages to more nodes. In fact, since we use a flooding technique, the amount of sent messages increases exponentially with the TTL value. To systematically analyze the impact of the TTL value, we compare COM and QET for three different TTL values; 3, 5, and 10. Figure 7 shows average completeness and execution time for each of the 5 groups of queries in our query load using the Bulk strategy. Even though many of the queries in groups L and CH timed out, they still provided some results before timing out. In general, a TTL value of 3 results in incomplete results for all query groups. We observe that even though a TTL value of 10 gives in total more complete results, the additional query execution time indicates that this might not necessarily be a good tradeoff. Instead, a TTL value of 5 shows almost as complete results with lower execution times.

Conclusions
In this paper, we proposed Piqnic (a P2p system for Query processiNg over semantIC data), to process queries over semantic datatsets. Piqnic is inspired by recent advances in decentralized Semantic Web systems as well as P2P systems in general, and provides a client that, in addition to providing query access to vast amounts of data, functions as a server maintaining a local datastore. We presented a general architecture for sharing and processing RDF data in a decentralized manner and customizable approaches for data fragmentation and query processing over a network of clients. Our experiments show that the Bulk strategy provides the best performance on average and that Piqnic is able to tolerate node failures. As highlighted by one of our experiments, it is not straightforward to find the a good balance between completeness, TTL, and query execution time. We will therefore investigate this problem in our future work.