Materialisation and Data Partitioning Algorithms for Distributed RDF Systems

Many RDF systems support reasoning with Datalog rules via materialisation , where all conclusions of RDF data and the rules are precomputed and explicitly stored in a preprocessing step. As the amount of RDF data used in applications keeps increasing, processing large datasets often requires distributing the data in a cluster of shared-nothing servers. While numerous distributed query answering techniques are known, distributed materialisation is less well understood. In this paper, we present several techniques that facilitate scalable materialisation in distributed RDF systems. First, we present a new distributed materialisation algorithm that aims to minimise communication and synchronisation in the cluster. Second, we present two new algorithms for partitioning RDF data, both of which aim to produce tightly connected partitions, but without loading complete datasets into memory. We evaluate our materialisation algorithm against two state-of-the-art distributed Datalog systems and show that our technique offers competitive performance, particularly when the rules are complex. Moreover, we analyse in depth the effects of data partitioning on reasoning performance and show that our techniques offer performance comparable or superior to the state of the art min-cut partitioning, but computing the partitions requires considerably less time and memory.


Introduction
The Resource Description Framework (RDF) is a popular data format that allows a domain of interest to be represented in terms of entities called resources, and labelled relationships between resources called triples. An RDF dataset can be seen as a directed graph in which triples correspond to edges between resources. While answering queries over an RDF dataset is the focus of most RDF applications, reasoning capabilities of RDF systems have been growing in importance. RDF reasoning systems take as input an RDF dataset and a formal description of an application domain, which is often captured using a prominent rule-based formalism called Datalog [3]. A Datalog rule expresses an 'if-then' condition specifying how to derive one or more triples from structural patterns in an RDF dataset. When answering queries, RDF reasoning systems take into account not only the explicitly given triples, but also triples that logically follow from a given set of Datalog rules. The computational properties and the expressivity of Datalog are well understood, which has contributed to wide adoption of Datalog in practice. For example, reasoning in the OWL 2 RL profile of the Web Ontology Language (OWL) can be supported either by translating an OWL ontology into rules [18], or by using the fixed rule set from the OWL 2 RL specification [39]. Furthermore, application logic is sometimes captured directly in Datalog rules [44,35,38]. Thus, developing efficient algorithms for Datalog reasoning over RDF datasets is an active research topic. Datalog reasoning is often supported by materialisation: all triples that logically follow from a dataset and a set of rules are precomputed and stored in a preprocessing step, so that queries can be evaluated without referring to the rules. Materialisation is typically realised using the seminaïve algorithm [3], which ensures the nonrepetition property: no rule is applied to the same triples more than once. This was shown to be essential in practice for even moderately sized datasets.
The size of RDF datasets used in applications has been increasing continuously. For example, the UniProt 1 dataset contains over 34 billion triples; moreover, many applications combine several large datasets. This poses significant challenges to RDF systems that centralise processing on a single computer. The answer is often to partition the data in a cluster of shared-nothing servers, but this introduces considerable complexity: related triples may reside on different servers so network communication may be needed. In the context of distributed RDF querying, numerous solutions have been presented and incorporated into systems such as YARS2 [24], 4store [23], H-RDF-3X [26], Trinity.RDF [59], SHARD [46], SHAPE [33], Partout [14], AdPart [5], TriAD [21], SemStore [57], DREAM [22], and WARP [25]. Abdelaziz et al. [2] surveyed 22 and evaluated 11 such systems on a variety of data and query loads, showing AdPart [5] and TriAD [21] to be the best performing.
Distributed reasoners face several problems that are not found in distributed query answering systems: freshly derived triples must participate in all relevant inferences, which can interact with mechanisms for distributing and storing derived triples; moreover, it is essential for the nonrepetition property to be preserved. These issues have been addressed in practice in several different ways. Certain systems handle only fixed Datalog rules: systems by Kaoudi et al. [30] and Weaver and Hendler [55] handle RDFS rules; WebPIE [54] and Cichlid [19] support the so-called ter Horst fragment [53]; and SPOWL [34] supports OWL 2 RL rules. While tailoring the reasoning algorithms to specific rules simplifies issues such as nonrepetition of derivations, such solutions are limited in their generality. PLogSPARK [58] can handle arbitrary rules, but it does not seem to use seminaïve evaluation. BigDatalog [50] and Cog [28] implement the seminaïve algorithm, but they seem to be able to process only a few linear rules at a time. Distributed SociaLite [48] implements the seminaïve algorithm for arbitrary Datalog rules. The standard techniques for implementing the seminaïve algorithm require maintaining and copying several auxiliary relations, which can be inefficient in a distributed system. Thus, the tradeoffs in developing algorithms for distributed Datalog reasoning do not yet seem to be fully understood.
Another problem in distributed RDF systems is to partition the data in a way that facilitates efficient distributed computation: intuitively, tightly connected clusters of resources should be placed on a single server in order to reduce communication during both rule matching and fact derivation. Very little attention has been devoted to this problem so far. Most distributed RDF systems use a variant of either subject hashing, where the placement of a triple is determined by hashing the triple's subject, or min-cut partitioning [31], where resources are partitioned to minimise the number of triples spanning two partitions. The former technique is simple to implement, but it does not produce tightly connected partitions; in contrast, min-cut partitioning tends to produce tight partitions, but it requires considerable time and memory and may be infeasible on large datasets. Thus, the question of how to partition the data in distributed RDF reasoning systems is still largely open.
In this paper, we present several novel techniques that provide the foundation for scalable distributed RDF reasoning systems. Our contribution is two-fold.
First, we present a new algorithm for distributed materialisation of Datalog rules over RDF datasets. We build on the work by Potter et al. [45] on distributed query answering using dynamic data exchange, from which we inherit several important properties. First, inferences that can be made within a single server are made without any communication; coupled with careful data partitioning, this can significantly reduce network communication overheads. Second, rule evaluation is completely asynchronous, which promotes parallelism. This, however, introduces a complication: to ensure nonrepetition of inferences, we must be able to partially order rule derivations across the cluster. We address this problem using Lamport timestamps [32], which allows us to support seminaïve evaluation without expensive maintenance of auxiliary relations. Moreover, dynamic data exchange requires careful maintenance of certain indexes as new facts are derived, which introduces considerable technical difficulties due to asynchronous processing. We present our materialisation algorithm in Section 4. Second, we consider the problem of partitioning RDF data. We draw our inspiration from the extensive literature on streaming graph partitioning algorithms that can process large graphs 'on the fly'. Specifically, such algorithms read a suitable encoding of a graph sequentially (possibly more than once), but their memory use is determined by the number of vertices, rather than the number of edges in the graph. A recent survey [42] identified the HDRF [43] algorithm for streaming partitioning of undirected graphs as particularly suitable for graphs with power-law degree distribution. The more recently proposed 2PS [37] algorithm seems to be able to outperform HDRF in some cases. RDF datasets often contain at least an order of magnitude more triples than resources, so one can expect streaming approaches to be particularly suitable to very large RDF datasets. Thus, in Section 5 we present two new algorithms for streaming partitioning of RDF data that adapt the HDRF and 2PS algorithms to the specifics of RDF. Since subject-subject joins are the most common in RDF queries [15], a key challenge is to ensure that our modified algorithms always place all triples with the same subject on one server.
We have implemented our reasoning and partitioning algorithms in a prototype system called DMAT. In Section 6, we present the results of several experiments that we used to evaluate our techniques. First, we analysed how different data partitioning strategies affect the performance of reasoning. Second, to explore the limits of our approach, we investigated how reasoning performance scales with increasing data loads. Third, to evaluate our reasoning approach against the state of the art, we compared the performance of materialisation in DMAT with that of BigDatalog [50] and Cog [28]. Our results show that our data partitioning algorithms are generally very effective in reducing communication during reasoning, and that this often leads to shorter reasoning times. Moreover, DMAT could handle increasing data loads well, and it outperformed the competition on all benchmarks. Thus, our techniques seem to provide a sound foundation for the development of massively scalable distributed RDF reasoners.

Preliminaries
To make this paper self-contained, we now recapitulate the definitions that we use in the rest of this paper.

Syntax.
A constant is an IRI, a blank node, or a literal. A term is a constant or a variable. An atom is an expression of the form ⟨ , , ⟩ over terms (subject), (predicate), and (object). Let Π = { , , } be the set of positions. Then, given an atom = ⟨ , , ⟩ and a position ∈ Π, | is the term that occurs in atom at position -that is, | = . A fact is a variable-free atom. Whenever no explicit qualification is given, we use lowercase letters from the end of the alphabet ( , , , . . . ) for variables, lowercase letters from the beginning of the alphabet ( , , , . . . ) for subject and object constants, and uppercase letters from the middle of the alphabet ( , , , . . . ) for predicate constants.
An (RDF) dataset is a finite set of facts. The vocabulary of is the set of all constants occurring in . For a constant, let + ( ) = {⟨ , , ⟩ ∈ | = } and ( ) = {⟨ , , ⟩ ∈ | = or = }. Then, | + ( )| and | ( )| are the out-degree and the degree of , respectively. A query is a conjunction of atoms of the form (1), where ≥ 1 and all are atoms. A Datalog rule is an implication of the form (2), where ℎ is the head atom, all are body atoms, ≥ 1, and each variable occurring in ℎ also occurs in some . A Datalog program is a finite set of rules.
Note that this definition allows for arbitrarily shaped rules over RDF data. In particular, it includes, but is not limited to, the OWL 2 RL/RDF rules [39], or any subset of these rules such as, for example, the ter Horst fragment [53].
In RDF literature, constants are often called RDF terms, atoms are called triple patterns, facts are called triples, and datasets are called RDF graphs. In this paper, however, we adopt the terminology commonly used in the literature on Datalog reasoning.

Substitutions.
A substitution is a partial function that maps finitely many variables to constants. For a term or an atom, is the result of replacing with ( ) each occurrence of a variable in on which is defined.
Semantics. Let be a dataset. For a query of the form 1 ∧ ⋯ ∧ , substitution is an answer to on if is defined precisely on all variables occurring in , and ∈ holds for each 1 ≤ ≤ . The result of applying a rule of the form ℎ ← 1 ∧ ⋯ ∧ to is the set of facts For a program, let ( ) = ⋃ ∈ ( ); let 0 ( ) = ; and let +1 ( ) = ( ( )) for ≥ 0. Then, the closure of on is defined as ∞ ( ) = ⋃ ≥0 ( ). When the closure is computed in advance and persisted, we refer to both ∞ ( ) and the process of computing it as materialisation.
Seminaïve Evaluation. We can compute ∞ ( ) using the definition just given: we evaluate the body of each rule ∈ as a query over and instantiate the head of for each query answer, we eliminate duplicate facts, and we repeat the process until no new facts are derived. However, ( ) ⊆ +1 ( ) holds for each ≥ 0, so this naïve approach derives in each round of rule application all facts from all previous rounds. The semïnaive strategy avoids this problem: when matching a rule in round + 1, at least one body atom of must be matched to a fact derived in round . This is critical in practice even for very simple rules.

Dataset Partitions.
A partition  of an RDF dataset is a list of RDF datasets  = 1 , … , such that ∩ = ∅ for 1 ≤ < ≤ and = ⋃

=1
. We call datasets partition elements. The objective of distributed reasoning is to compute ∞ ( ) using a partition where each partition element is stored in a distinct server in a shared-nothing cluster. For convenience, we identify each server in the cluster by an integer between 1 and .

Partitioning Problem.
A key question in distributed RDF processing is to compute a partition of an RDF dataset that facilitates efficient query processing and reasoning. In the graph partitioning literature, the vertex replication factor is often used as a measure of partition 'tightness' [43,42]. This notion can be adapted to RDF as follows: for  = 1 , … , a partition of an RDF dataset, the replication set of a constant is defined as ( ) = { | ∩ ( ) ≠ ∅}, and the replication factor of a partition  is defined as where is the vocabulary of . In other words, the replication factor is the average number of servers that constants occur on. The term 'replication' is sometimes used in the RDF literature to denote the idea of storing the same fact in more than one server, but this is not the intended meaning in the literature on graph partitioning and our work; in fact, partition elements in this paper are all pairwise disjoint. Given a fixed tolerance parameter ≥ 1, the objective of graph partitioning is to compute a partition  of an RDF dataset such that | | ≤ | | holds for each 1 ≤ ≤ , while minimising the replication factor ( , ). In other words, each should hold roughly the same number of facts, while ensuring that constants are replicated as little as possible. Solving this problem exactly is computationally hard, so the objective is usually weakened in practice. The algorithms we present in this paper will honour the restrictions on the sizes of , and they will aim to make the replication factor small, but without firm minimality guarantees.

Related Work
Before presenting our contribution, we first discuss the relevant related work. In Section 3.1, we discuss the existing distributed Datalog reasoning systems and approaches. In Section 3.2, we survey the work by Potter et al. [45] on distributed query answering with dynamic data exchange, which provides the starting point for our distributed reasoning algorithm. In Section 3.3, we discuss the data partitioning approaches typically used in distributed RDF systems.

Approaches to Distributed Reasoning
A number of approaches were developed in the 90s for materialising Datalog programs when the data is distributed across several processors. These approaches are not specific to RDF, but we shall discuss them in our setting. The key idea is to partition rule applications to processors. For example, to evaluate ⟨ , , ⟩ ← ⟨ , , ⟩ ∧ ⟨ , , ⟩ on processors, we let each processor with 1 ≤ ≤ evaluate rule where ℎ( ) is a partition function that maps values of to integers between 1 and . If ℎ is uniform and constants are uniformly distributed across triples, then each processor receives roughly the same fraction of the workload. Practical experience suggests that such an approach can often parallelise computation very effectively. However, since a fact of the form ⟨ , , ⟩ can match either atom in the body of rule (4), each such fact must be replicated to processors ℎ( ) and ℎ( ) to ensure completeness. Based on this idea, Ganguly et al. [16] show how to handle general Datalog; Zhang et al. [61] study different partition functions; Seib and Lausen [47] identify programs and partition functions where no replication of derived facts is needed; Shao et al. [49] further break rules in segments; and Wolfson and Ozeri [56] replicate all facts to all processors in order to increase parallelism. The primary motivation behind these approaches seems to be parallelisation of computation, which explains why the high rates of data replication were deemed acceptable. Materialisation can also be implemented without any data replication. First, one must select a data partitioning strategy: a common approach is to assign each triple ⟨ , , ⟩ to server ℎ( ) using a suitable hash function ℎ, and another popular option is to use a distributed file system (e.g., HDFS) and thus leverage its partitioning mechanism. Second, the rule bodies are evaluated using a distributed query evaluation algorithm, the newly derived facts are distributed according to the partitioning strategy, and the process is repeated iteratively as long as fresh facts are derived.
These principles were applied to reasoning over RDF datasets. Most systems in this category handle only fixed rule sets (i.e., the rules are hardcoded in the algorithms and cannot be changed). The systems by Weaver and Hendler [55] and Kaoudi et al. [30] support RDFS reasoning. Other systems borrow mechanisms for distributed data storage and query evaluation from big data frameworks such as Hadoop and Spark. In particular, WebPIE [54] supports the ter Horst fragment [53] in Hadoop; Cichlid [19] also supports ter Horst fragment, but in Spark; and SPOWL [34] supports extensions of the OWL 2 RL rules in Spark. Handling only fixed rule sets considerably simplifies the design of reasoning algorithms. For example, seminaïve evaluation is not needed for RDFS reasoning since nonrepetition of inferences can be ensured by evaluating the rules in a particular order. Greater generality is offered by PLogSPARK [58], which handles general Datalog rules over RDF data in Spark. However, this system seems to use naïve rule evaluation, which can be prohibitive when the rules are complex.
Distributed Datalog reasoning has also been studied in the database community. BigDatalog [50] and Cog [28] were implemented on top of Spark and Flink, respectively, but they target a slightly different setting. Both systems accept as input a Datalog program and a query. They then select the rules relevant to the query, compute their materialisation, and evaluate the given query over the materialisation. However, the materialisation is not persisted: it is treated as a temporary object used to just answer the query. Both systems claim to support arbitrary Datalog and seminaïve evaluation, but, as we report in Section 6, we managed to use them with only a handful of linear rules. Yedalog [10] is Google's private implementation of Datalog on top of a proprietary data model. Distributed SociaLite [48] implements distributed seminaïve materialisation for general Datalog, but users must explicitly specify the data distribution strategy and communication patterns. For example, by writing a fact ( , ) as [ ]( ), one specifies that the fact should be stored on server ℎ( ) for some hash function ℎ. Rule (4) can then be written in SociaLite as [ ]( ) ← [ ]( ) ∧ [ ]( ), specifying that the rule should be evaluated by sending each fact [ ]( ) to server ℎ( ), joining such facts with [ ]( ), and sending the resulting facts [ ]( ) to server ℎ( ). While the evaluation of some of these rules can be parallelised, all servers in a cluster must synchronise after each round of rule application. Yedalog and SociaLite also implement extensions of Datalog, such as monotonic aggregation and builtin predicates.

Dynamic Data Exchange
Distributed query answering is a key ingredient of distributed reasoning algorithms. Most distributed query evaluation algorithms can be understood as using a variant of the data exchange operators first introduced in the Volcano system [17]. Such operators send and receive partial query answers during query evaluation, and are introduced into query plans to ensure that all partial answers that may participate in a join are transmitted to one server.
Recently, Potter et al. [45] presented a distributed query evaluation algorithm where data exchange is dynamic: communication is guided by the data encountered during query evaluation, rather than being fixed at query compilation time. The objectives of dynamic data exchange are to reduce communication and eliminate synchronisation between servers. To this end, each server maintains three indexes called occurrence mappings. For each constant occurring in , occurrence mapping , ( ) contains all servers where occurs in the subject position, and occurrence mappings , ( ) and , ( ) provide analogous information for the predicate and object positions. Occurrence mappings thus allow a server to locate all facts containing a particular constant during query answering; this is reminiscent of decentralised indexes in peer-to-peer RDF systems by Aebeloe et al. [4].
The occurrence mappings are initialised as shown in Table 1.
In particular, the occurrence mappings on each server must cover all constants occurring in this server; thus, 1, , 1, , and 1, are defined on constants , , , and , whereas 2, , 2, , and 2, are defined on constants , , , and . Query processing is started by sending the query to the two servers, each storing a partition element. Each server independently evaluates over its partition using index nested loop joins. Thus, server 1 evaluates atom ⟨ , , ⟩ over Table 1 Example Occurrence Mappings on Servers 1 and 2 , which produces a partial answer 1 = { ↦ , ↦ }. Server 1 then evaluates ⟨ , , ⟩ 1 = ⟨ , , ⟩ over 1 and thus obtains one full answer 2 = { ↦ , ↦ , ↦ }. To see whether ⟨ , , ⟩ can be matched on other servers, server 1 consults its occurrence mappings for all constants in the atom. Since 1, ( ) = 1, ( ) = {1, 2}, server 1 sends the partial answer 1 to server 2, telling it to continue matching the query from the second atom. After receiving 1 , server 2 matches atom ⟨ , , ⟩ in 2 to obtain another full answer 3 = { ↦ , ↦ , ↦ }. Server 2 also evaluates ⟨ , , ⟩ over 2 and obtains partial answer 4 = { ↦ , ↦ }, and it consults its occurrences to determine which servers can match ⟨ , , ⟩ 4 = ⟨ , , ⟩. Since 2, ( ) = {2}, server 2 knows it is the only one that can match this atom, so it proceeds without any communication and computes 5 This strategy has several important benefits. First, all answers that can be produced within a single server, such as 5 in our example, are produced without any communication. Second, the location of every constant is explicitly recorded, rather than computed using a fixed rule such as a hash function. An RDF dataset can thus be partitioned based on its structural properties, and highly interconnected constants can all be placed on one server with the aim of reducing network communication. Third, the system is fully asynchronous: when server 1 sends 1 to server 2, server 1 does not need to to wait for server 2 to finish, and server 2 can process 1 whenever it can. The lack of synchronisation between servers is beneficial to parallelisation.

Data Partitioning
Reducing I/O cost via judicious data partitioning has a long tradition in RDF systems; for example, Aluç et al. [7] proposed a self-tuneable data partitioning scheme that aims to optimise on-disk placement of data. However, while it is intuitive to expect that partitioning the data carefully to minimise communication would improve the performance of distributed systems, the effects of data partitioning remain poorly understood. Janke et al. [29] studied this problem in the context of distributed query processing. Interestingly, they concluded that reducing communication can be detrimental if done at the expense of uneven server workload. However, it is unclear to what extent their conclusions apply to distributed reasoning. Reasoning over large datasets involves evaluating millions of queries and distributing derived facts, both of which can incur much more communication than in the case of a single query. Moreover, workload imbalances in individual queries could even themselves out when many queries are evaluated.
Existing approaches to data partitioning can be broadly divided into three groups. The first groups consists of systems that store their data in a distributed file system. Data is usually allocated randomly to servers, which makes exploiting data locality during reasoning difficult. The second group consists of hash-based variants, where the location of a fact is determined by hashing one or more of the fact's components (usually subject). The third group consists of variants based on min-cut graph partitioning, which aims to reduce communication by minimising the number of edges between partitions. Since subject-subject joins were shown to be very common [15], most systems in the latter two groups colocate triples with the same subjects.
Distributed RDF systems sometimes use data replication, where some or all facts are stored on more than one server. The decision which facts to replicate is typically made during data loading. A prominent approach ishop replication, where facts are replicated so that all path queries of length or less can be processed without any communication [27]. Other approaches to replicating data during loading have been considered too [21,33]. A more recent approach aims to analyse the query load and replicate commonly used triples on the fly [5]. While data replication can be effective at reducing communication in a distributed RDF system, it can also significantly increase the storage requirements; for example, it was shown that undirected twohop replication can increase the storage requirements by a factor of 4.2 [27]. In reasoning systems, data replication can increase the communication needed to store newly derived facts, and it may lead to redundant derivations. Since our main objective is to devise a distributed reasoning algorithm that does not repeat derivations, we do not consider data replication in our work. In fact, we are unaware of any related approach to distributed reasoning that uses data replication.

Distributed Materialisation Algorithm
We now present our distributed materialisation algorithm. Before presenting the technical details, in Section 4.1 we discuss our main technical challenges. Then, in Section 4.2 we discuss data structures used to store facts; in Section 4.3 we introduce the occurrence mappings; in Section 4.4 we discuss the communication infrastructure; in Section 4.5 we discuss how to detect termination; in Section 4.6 we present the algorithm's pseudocode and argue about its correctness; and in Section 4.7 we illustrate various aspects of our algorithm using a simple example.

Technical Challenges
As mentioned in Section 2, seminaïve evaluation is critical to ensuring practicability of materialisation. This algorithm evaluates rule bodies as queries, so it may be tempting to try to adapt seminaïve evaluation to a distributed setting by switching to a distributed query answering algorithm. We discuss the problems of such a straightforward approach by means of the example rule (8).
The objective of seminaïve evaluation is to avoid repeating derivations, where a derivation refers to matching the body of a rule to a particular set of facts. For example, matching rule (8) to facts ⟨ , , ⟩ and ⟨ , , ⟩, and to facts ⟨ , , ⟩ and ⟨ , , ⟩, constitutes two distinct derivations. Seminaïve evaluation ensures that each of these derivations is made just once, but it does not prevent both derivations from producing ⟨ , , ⟩: a separate duplicate elimination step is required to deal with that issue. To this end, rules are applied in rounds, but in each rule application at least one body atom is matched to a fact derived in the previous round. A common way to ensure this is to maintain four sets of facts: the set of 'current' facts, the set of 'old' facts derived before the last round, the 'delta' set of facts that are present in the 'current' but not the 'old' set, and the set of 'new' facts derived in each round of rule application. Rule (8) is then rewritten to use these sets as follows.
Thus, rule (9) produces all derivations of rule (8) where atom ⟨ , , ⟩ in the latter rule is matched to a fact derived in the previous round of rule application. Rule (9) handles the second body atom of rule (8) analogously; by matching the first body atom of rule (10) to the 'old' facts, we ensure no repetition of inferences between the two rules. Rules (9) and (10) can be evaluated using any distributed query evaluation strategy. Each round of rule application is followed by a round of updates: the 'delta' facts are added to the 'old' facts, and the difference between the 'new' and the 'current' facts is copied to the 'delta' facts and added to the 'current' facts. This can be inefficient for at least two reasons. First, the update round requires shuffling of data between datasets, which can be costly; this can be particularly acute in systems built on top of Hadoop and Spark, where data storage and distribution is managed automatically by a distributed file system. Second, rounds must be synchronised (i.e., updates cannot begin before rule application finishes and vice versa), which can lead to workload skew. We address these problems by drawing inspiration from the work by Motik et al. [40] on parallelising Datalog materialisation in centralised, shared memory systems. Instead of applying rules in rounds, this approach considers each fact in the dataset, identifies each rule and body atom that can be matched to the fact, and evaluates the rest of the rule as a query. Moreover, all facts are ordered totally based on the sequence of their derivation; this order can be recovered easily from how facts are stored. To prevent inference repetition, each query is evaluated only against the facts derived before the fact being processed. Such an approach does not proceed in rounds and does not need an explicit update step; rather, rules are applied asynchronously, which is beneficial for parallelisation: the number of facts is generally very large, so the materialisation process decomposes naturally into a large number of completely independent steps that can be evaluated in parallel. The approach has been successfully applied to very large datasets [41].
Our distributed materialisation algorithm is based on the same general principle: each server in a cluster matches the rules to locally stored facts, but the resulting queries are evaluated using dynamic data exchange. Our approach thus requires no synchronisation between servers, and communication is reduced in the same way as described in Section 3.2. We thus expect the same benefits as for the query answering algorithm by Potter et al. [45]. However, the lack of synchronisation between servers introduces a complication: since there is no notion of a rule application round, it is not obvious how to guarantee nonrepetition of inferences. A straightforward solution might be to associate each fact with a timestamp recording when the fact was derived so that the order of fact derivation can be recovered; however, this would require maintaining a high coherence of server clocks in the cluster, which is generally impractical. In Section 4.2, we discuss how we solve this problem using Lamport timestamps [32]-a well known, simple way of determining a partial order of events across a cluster.
Another complication arises due to the observation that occurrence mappings may need updating when new facts are derived. The occurrence mappings of all relevant servers must be updated before any rule is applied to such newly derived facts; otherwise, such facts might be skipped during query evaluation, which would jeopardise completeness. Our solution to this problem is again fully asynchronous.
Finally, since no central coordinator keeps track of the state of the computation of different servers, detecting when the system as a whole can terminate is not straightforward. We solve this problem using a well-known termination detection algorithm based on token passing [11].

Lamport Timestamps
To prevent repetition of derivations, we need to arrange all derived facts into a sequence that reflects the derivation order. As mentioned already, maintaining a precise global clock in a distributed system is very challenging, so we instead use Lamport timestamps [32], a general and simple technique for ordering events in a distributed system. In particular, each event is annotated with an integer timestamp in a way that guarantees the following property: ( * ) if there is any way for an event to possibly influence an event , then the timestamp of is strictly smaller then the timestamp of .
To achieve this, each server keeps an integer counter called the local clock (even though this counter does not measure time), which is incremented whenever an event of interest occurs; this clearly ensures property ( * ) for events and occurring in the same server. Now assume that events and occur in servers 1 and 2, respectively; clearly, can influence only if server 1 sends a message to server 2, and server 2 processes this message before event takes place. To ensure property ( * ), server 1 includes its current clock value into the message it sends to server 2; moreover, when processing this message, server 2 updates its local clock to the maximum of the message clock and the local clock, and then increments the local clock. Thus, when event happens after receiving the message, the timestamp of the event is guaranteed to be larger than the timestamp of event .
To apply this idea to Datalog materialisation, a derivation of a fact corresponds to the notion of an event, and using a fact to derive another fact corresponds to the 'influences' notion. Thus, we associate facts with integer timestamps.
More precisely, each server in the cluster maintains an integer called the local clock, a dataset of facts stored in the server, and a timestamp function ∶ → ℕ that associates each fact in the server with a natural number. Before materialisation, is initialised to zero, and all input facts (i.e., the facts given by the user) assigned to server are loaded into and assigned a zero timestamp. To capture formally how timestamps are used during query evaluation, we introduce the notion of an annotated query as a conjunction of the form where each ⋈ is an annotated atom consisting of an atom and a symbol ⋈ that can be < or ≤. Answers to an annotated query are computed with respect to a timestamp. More precisely, a substitution is an answer to on a dataset and function w.r.t. an integer timestamp if • is an answer to the 'ordinary' query 1 ∧ ⋯ ∧ on dataset as usual, and For example, let , , and be as below, and let = 2 be a timestamp.
is not an answer to on and w.r.t. due to (⟨ , , ⟩) > 2.
We assume that each server can evaluate one annotated atom-that is, server can call EVALUATE( ⋈ , , , , ), where ⋈ is an annotated atom, is a timestamp, is the dataset stored in the server, provides timestamps for the facts in , and is a substitution. The call returns each substitution defined over the variables in and such that ⊆ , ∈ , is defined on , and ( ) ⋈ holds. In other words, EVALUATE matches ⋈ on , , and , and it returns each match that extends and satisfies ⋈ and . For efficiency, server should index the facts in . Any RDF indexing scheme can be used, and index lookup can be modified to skip facts whose timestamps do not match .
Finally, we assume that each server implements a function MATCHRULES( , ) that takes as input a fact and a Datalog program . For each rule ℎ ← 1 ∧ ⋯ ∧ in , each body atom with 1 ≤ ≤ , and each substitution over the variables of such that = , the function returns the 4-tuple ( , , , ℎ) where Intuitively, MATCHRULES identifies each rule and each pivot body atom that can be matched to via substitution . This will be extended to all body atoms of the rule by recursively matching all remaining atoms using function EVALUATE. The annotations in (12) specify how to match the remaining atoms without repetition: facts matched before the pivot must have timestamps strictly smaller than the timestamp of , and facts matched after the pivot must have timestamps strictly smaller or equal to the timestamp of . The atoms of query (12) may need to be reordered to obtain an efficient query plan. This can be achieved using any known query planning technique, and further discussion of this issue is out of scope of this paper.

Occurrence Mappings
As in the query answering approach based on dynamic data exchange [45], each server must store indexes , , , , and , called occurrence mappings. Each of these maps constants to (possibly empty) sets of integers identifying servers. Intuitively, , ( ) can be used on server to identify the servers on which constant occurs as the subject of a fact. However, requiring each server to track the location of all constants would mean that the servers' memory use is determined by the size of the entire RDF dataset, rather than the size of the partition element stored in the servers. To remedy this, we allow each server to track locations of certain constants; the exact conditions on which constants must be tracked are formalised in Definition 4.1. Roughly speaking, we require each server to keep track only of constants occurring in the partition element stored in the server, so the size of each occurrence mapping is bounded by the size of the corresponding partition element. Occurrence mappings satisfy the subject constraint if, for each constant and each server , set , ( ) contains at most one element, and ∈ , ( ) implies that constant occurs in subject position of a fact in .
Occurrence mappings are correct for and if are consistent with and and satisfy the subject constraint.
To simplify the presentation of our algorithm, we assume that each , is defined on all constants. In practice, an implementation can choose to explicitly store only the values for the relevant constants, and use the empty set as the default value for all other constants. Moreover, occurrence mappings of one server are shared and possibly updated by multiple threads of execution during materialisation. We assume that mappings are accessed atomically: whenever , ( ) is used in our algorithm, the result is the value of mapping , on at a point in time.
We illustrate Definition 4.1 by an example involving the following partition elements 1 and 2 and rule .
Constant is relevant to server 1 because occurs in 1 (at any position). Consistency ensures that server 1 has complete information about the whereabouts of constantthat is, 1 ∈ 1, ( ) and 2 ∈ 1, ( ) must hold. Intuitively, since constant is relevant to server 1, there is a chance that the body of some rule can be matched to a fact containing , which can instantiate a variable in the rest of the rule body. In our example, atom ⟨ , , ⟩ in rule can be matched to ⟨ , , ⟩, which instantiates the second atom to ⟨ , , ⟩. To continue matching the rule body, 2 ∈ 1, ( ) is needed to instruct server 1 to forward the partial match to server 2 so that the latter server can potentially complete the match. Thus, consistency ensures completeness of rule body evaluation by placing a lower bound on occurrence mappings. Note that, say, 2 ∈ 1, ( ) is allowed to hold even though constant does not occur in 2 : this can only make server 1 send superfluous partial matches to server 2.
The subject constraint consists of two parts. The first part ensures that all facts with the same subject are assigned to one server. In our example, consistency ensures 1 ∈ 1, ( ), and the subject constraint ensures 1, ( ) = {1}-that is, all facts where occurs in the subject position must occur in 1 . Most distributed RDF systems group facts by subject for performance (see Section 3.3), but our approach uses this additionally to send each derived fact to the server containing the fact's subject. For example, when server 2 completes the match from the previous paragraph and derives ⟨ , , ⟩, it knows that the fact should be sent to server 1. The second part of the subject constraint ensures that the subject occurrence mappings do not contain superfluous servers, which is also important for correct placement of derived facts. Note that both servers 1 and 2 can derive ⟨ , , ⟩. Now let us assume that 1, ( ) = {1} (which is allowed by the consistency condition), so the fact is recorded on server 1. In contrast, let us assume that 2, ( ) = ∅ (which also satisfies consistency). Then, server 2 has no information about where to send ⟨ , , ⟩, so it determines the destination by hashing the fact's subject. Now hashing can result in the fact being sent to server 2, which would break the first part of the subject constraint. To remedy this, our approach requires the occurrences for the subject position to be exact.
The correctness condition just combines consistency and the subject contrast, and ensuring that this property remains preserved as new facts are derived is a key source of technical difficulty in our approach.
Allowing servers to track the location of relevant constants only introduces one complication: when server receives a partial match from another server, the occurrence mappings stored in server may not cover all constants in . Potter et al. [45] solve this by accompanying each partial match with a vector = , , of partial occurrences. Whenever a server extends by matching an atom, it also records in its local occurrences for each constant added to that can be used in the rest of the rule body. Occurrences of the matched constants are propagated together with partial matches, which ensures that each server has access to occurrences of constants in atoms that are yet to be matched.

Communication Infrastructure and Messages
We assume that each server in the cluster can send a message to a destination server by calling SEND( , ). This function can return immediately, and the receiver can process the message later-that is, communication can be asynchronous. Also, our core algorithm is correct as long as each sent message is processed eventually-that is, messages sent between a pair of servers can be processed in an arbitrary order without affecting correctness. However, the approach used to detect termination (which is largely orthogonal to our core algorithm) can introduce other message types and might impose constraints on the order of message processing; we discuss this in more detail in Section 4.5.
Message [ , , , ℎ, , ] informs a server that is a partial match obtained by matching some fact with timestamp to the body of a rule with head atom ℎ; moreover, the remaining atoms to be matched are given by an annotated query starting from the atom with index . The partial occurrences of the constants in that may be needed when matching the remaining atoms of are recorded in .
Message [ , , ] says that is a new fact to be stored at server processing the message. Timestamp reflects when the message was sent, and records the partial occurrences of the constants in .
Message [ , , ℎ , , ] says that is a new fact to be stored at server ℎ . Set identifies servers whose occurrences may need updating before can be added to ℎ . Timestamp reflects when the message was sent, and records the partial occurrences of the constants in .
Potter et al. [45] observed that messages correspond to partial join results so a large number of such messages can be produced during query evaluation. For asynchronous processing, the messages may need to be buffered on the receiving server, which can require considerable memory. They also presented a flow control mechanism that can restrict memory consumption at each server without jeopardising completeness. This solution is directly applicable in our approach as well, so we do not discuss it further.

Ensuring Termination
Termination of our approach follows in the same way as in a centralised setting. In particular, the set of facts stored in each server grows monotonically and duplicate facts are removed; since the number of constants in the input dataset is finite, each server can derive at most finitely many facts. Analogously, occurrence mappings grow monotonically as well, so each server can end up containing the maximal occurrence mappings, which are clearly finite as well. Thus, at some point in time, each server will run out of facts to process, at which point it will not generate any further messages. Once all messages queued in the system have been processed, all servers can terminate. However, no server in our approach keeps track of the progress of other servers, so detecting that all servers are idle is not straightforward. We next summarise a well-known solution to this problem.
When messages between each pair of servers are guaranteed to be processed in the order in which they are sent (as is the case in our implementation), one can use Dijkstra's token ring algorithm [11]. All servers in the cluster are numbered from 1 to and are arranged in a ring (i.e., server 1 comes after server ). Each server can be black or white, and the servers will pass between them a token that can also be black or white. Initially, all servers are white and server 1 has a white token. The algorithm proceeds as follows.
• When server 1 has the token and it becomes idle (i.e., it has no pending work or messages), it sends a white token to the next server in the ring.
• When a server other than 1 has the token and it becomes idle, the server changes the token's colour to black if the server is itself black (and it leaves the token's colour unchanged otherwise); the server forwards the token to the next server in the ring; and the server changes its colour to white.
• A server turns black whenever it sends a message (i.e., not just a token message) to a server < .
• All servers can terminate when server 1 receives a white token.
The Dijkstra-Scholten algorithm [12] can be used when messages sent between a pair of servers are not guaranteed to be processed in the order in which they are sent. We do not use this algorithm in our implementation, so we do not discuss the details any further.

The Algorithm
With these definitions in mind, Algorithm 1 presents our approach to distributed Datalog materialisation. Before starting, each server receives a copy of the program to be materialised, loads the corresponding partition element into its local store , sets the timestamp of each fact in to zero, initialises its occurrence mappings , , , , and , to be correct for and 1 , … , as per Definition 4.1, and initialises its local clock to zero. The server then starts an arbitrary number of server threads, each executing the SERVERTHREAD function. Each thread repeatedly processes an unprocessed fact in or an unprocessed message ; if both are available, ties are broken arbitrarily. Otherwise, termination is checked as discussed in Section 4.5.
The core of our approach involves matching body atoms of the rules in the program. Rule matching on server can commence in one of the following two ways. First, an unprocessed fact can be extracted from the partition element and passed to the PROCESSFACT function. To start matching the rules to , our algorithm calls MATCHRULES to identify all rules where one pivot atom matches to via a partial answer , and it uses the FINISHMATCH function to extend to a full answer. Second, a message can be received. The message contains a partial answer , and it instructs the server to continue matching a rule body from atom . Thus, the server evaluates atom ⋈ in and w.r.t. to enumerate all partial answers ′ , and for each it uses the FINISHMATCH function to extend ′ to a full answer.
Function FINISHMATCH finishes matching atom by (i) extending with the occurrences of all constants that might be relevant for the remaining body atoms or the rule head, and (ii) either matching the next body atom or deriving the rule head. For the former, when variable is matched to a constant , the occurrences of can be needed to match the rest of the rule only if occurs in the remaining body atoms or in the head atom of the rule being matched. Therefore, the algorithm identifies in line 17 each such variable and adds the occurrences of to for each position . Now if has been matched completely (line 18), the server also ensures that the partial occurrences are correctly defined for the constants occurring in the rule head (lines [19][20], it identifies the server ℎ that should receive the derived fact as described in Section 4.3, and sends the message to ℎ . Otherwise, atom + must be matched next. To determine the set of servers that could possibly match + , server intersects the occurrences of each constant from + (line 26) and sends a message to all servers in . A message informs server that fact is a newly derived fact that should be added to . The main challenge is to ensure that adding to does not affect the correctness condition from Definition 4.1. To this end, our algorithm identifies in lines 31-36 the set of servers whose occurrences might need updating. In particular, if contains a constant = | at position and server does not occur in the local occurrences , ( ), then all servers containing constant at any position might need updating (line 36); moreover, if occurs in the head of a rule in , then all servers need to be informed of the location of (line 34). Once the set of candidate servers has been constructed, an message is sent to some server in ; since the occurrences of all servers must be updated before fact is added to partition element , the message is sent to server ℎ only if ℎ is the only remaining server in .
An message informs server that fact will be added to ℎ so the occurrences of the constants in might need updating. Set lists the remaining servers that must be informed. A key difficulty arises when constant becomes while cannot terminate do 3: if contains an unprocessed fact , or a message is pending then 4: PROCESSFACT( , ( )) or PROCESSMESSAGE( ), as appropriate 5: else if the termination token has been received then 6: Process the termination token for each position ∈ Π and each variable that occurs in and either in ℎ or in some with > do 17: if is undefined on then Extend with the mapping ↦ , ( ) 18: if = then 19: for each position ∈ Π and each constant occurring in ℎ do 20: if is undefined on then Extend with the mapping ↦ , ( ) 21: if (ℎ | ) = ∅ then ℎ ∶= ℎ | mod ∶= the set of all servers 26: for each position ∈ Π such that +1 | is a constant do ∶= ∩ ( ) for each position ∈ Π and = | such that ∉ , ( ) do 32: Add to ( ) 33: if occurs in the head of a rule in then 34: Extend to contain all servers in the cluster 35: else 36: for each position ′ ∈ Π do Add ′ ( ) ∪ , ′ ( ) to 37: Remove an element from , preferring any element over if possible 38: PROCESSORSENDMESSAGE( relevant to several distinct servers at roughly the same time, so several messages referring to the same are circulating simultaneously. This problem is addressed as follows. Upon receiving an message for a fact containing a constant at position , set ( ) contains the servers that knew about when was derived. Moreover, another intervening message for constant will have already updated , ( ); thus, ∶= , ( ) ⧵ ( ) identifies the servers whose occurrences may need additional updating. After computing , server updates its , ( ) by adding ( ); these steps must be performed atomically so that, if two messages concurrently update , ( ), set computed in the second message contains the servers added by the first message. This set is then added to and ( ) (line 43). If set is empty at that point, then = ℎ holds due to how servers are extracted from in lines 37 and 46; consequently, all servers have been updated and fact can be added to the local partition element (line 44). Otherwise, the message is forwarded to the remaining servers in ensuring that ℎ is processed last (lines 46 and 47). The following theorem captures the formal properties of our algorithm-that is, the algorithm correctly computes the materialisation and exhibits the nonrepetition property. The proof is given in full in Appendix A.

Example
To clarify the ideas presented in this paper, we next present a simple example that illustrates the flow of processing in our system. To make the example manageable, we consider the rule from Section 4.3 and just two servers with partition elements 1 and 2 , each initially containing juts a single fact as shown below.
The timestamps of both facts are initialised to zero, and the occurrence mappings for the subject and object positions are initialised as follows. To make the example easier to follow, we ignore the occurrence mappings for predicate positions, and we do not consider constants occurring in such positions.
Note that constants and are relevant to server 1, so, to satisfy Definition 4.1, ,1 and ,1 must contain mappings for these two constants. Analogously, ,2 and ,2 must cover constants and since these are relevant to server 2.

Rule Matching and Message Flow
We now illustrate the process of rule matching and the flow of messages in our system. Both servers start matching the rule in their partition element. Server 1 matches atom ⟨ , , ⟩ to fact ⟨ , , ⟩ in 1 , which produces a partial match 1 = { ↦ , ↦ }. The body is not yet fully matched, so server 1 identifies the set of candidate servers that can finish the match (lines 25-26): ,1 ( ) = {2} identifies server 2 as a viable candidate. Note that server 2 has no information about the occurrences of constant , but this information will be needed once server 2 derives ⟨ , , ⟩. Thus, server 1 copies its local occurrences for and into partial occurrences and (lines [16][17], and it then sends the message to server 2 that contains the partial match 1 , timestamp 1 = 0, annotated query 1 = ⟨ , , ⟩ ≤ , and the following partial occurrences: Server 2 eventually processes this message (line 11) and attempts to match the partially instantiated second atom ⟨ , , ⟩ ≤ 1 = ⟨ , , ⟩ ≤ in 2 (line 13). Fact ⟨ , , ⟩ is a match since its timestamp is equal to zero, which is less or equal to 1 . This produces 2 = 1 ∪ { ↦ }, and server 2 extends the partial occurrences as follows (lines 16-17): The match is now complete (lines [19][20], so server 2 must determine where to store the derived fact ⟨ , , ⟩. Since 2 ( ) = ∅, the server knows that no server in the system contains constant in subject position, so it hashes the subject (line 21). Let us assume that this results in ℎ = 2that is, the fact should be stored on server 2. Server 2 thus forwards the message to itself containing fact ⟨ , , ⟩ timestamp 1 , and partial occurrences 2 (line 23).
Adding this fact to 2 will change the occurrences as constants and will appear on server 2 in object and subject positions, respectively. To maintain the correctness property of Definition 4.1, server 2 must disseminate the new occurrences to the relevant servers before the fact can be added to 2 . The server updates the partial occurrence to reflect that the fact will be added to server 2 (line 32), thus producing the following partial occurrences: Moreover, server 2 identifies which servers need to be informed of the change by combining the partial and local occurrences of all constants occurring in the derived fact (lines [31][32][33][34][35][36]. This produces = {1, 2}, so server 2 sends an message to server 1 (line 38) containing the derived fact and the updated partial occurrences 3 and 3 .
This message informs server 1 that fact ⟨ , , ⟩ is about to be added to 2 , so server 1 updates its local occurrences for each position and each constant in the fact (lines [41][42][43]. The resulting occurrences on server 1 are as follows: Server 1 then forwards the message further to server 2 (line 47), which eventually adds ⟨ , , ⟩ to 2 (line 44).

Nonrepetition of Derivations
In addition to the steps outlined in Section 4.7.1, server 2 also matches rule in partition element 2 , which produces the partial match 3 = { ↦ , ↦ }. Thus, server 2 will sent a message to server 1 containing 3 , timestamp 2 = 0, and annotated query 2 = ⟨ , , ⟩ < . When server 1 processes this message, the partially instantiated atom ⟨ , , ⟩ < matches to ⟨ , , ⟩ in terms of its structure, but not in terms of the timestamp. Therefore, the EVALUATE function returns no substitutions and consequently matching stops at this point. As a result of this, server 1 does not derive fact ⟨ , , ⟩-that is, this fact is derived only on server 2.

Streaming Partitioning of RDF Data
We now consider how to partition To this end, in Section 5.1 we review the drawbacks of the existing partitioning schemes and discuss our technical challenges. Then, in Sections 5.2 and 5.3, we present two new techniques that partition RDF data in a streaming fashion-that is, without loading the entire dataset in memory.

Motivation
Approaches to partitioning RDF data are often based on subject hashing. The main benefit of such approaches is its simplicity: just one pass over the dataset is needed, and just one fact must be kept in memory at any point in time. However, subject hashing does not take into account the structure of an RDF dataset, so there is no attempt to ensure locality of subject-object or object-object joins.
A number of partitioning approaches are based on variants of min-cut graph partitioning. Such approaches take the structural properties of an RDF dataset into account and are thus more likely to partition a dataset into tightly connected subsets. However, the time and memory requirements of such methods are often prohibitive. Typically, an entire dataset is loaded into a single server so that one can apply a graph partitioner such as METIS; 2 however, this defeats the main objective of distributing the data, which is to process large datasets using commodity servers. This drawback can, at least in principle, be addressed by using distributed partitioner such as ParMETIS. Nevertheless, min-cut graph partitioning is NP-hard [9] so, while implementations rarely solve this problem exactly, partitioning often takes a considerable amount of time and memory in practice. Thus, the questions of how to partition RDF data effectively, and how this affects distributed reasoning, are still largely open. To answer the former question, we draw inspiration the literature on streaming graph partitioning [51,43,60,52,36,37,42] methods, where the aim is to produce good partitions while iterating over the graph edges a fixed number of times. The memory usage of these approaches is typically determined by the number of vertices in the graph, which is usually at least an order of magnitude smaller than the number of edges. This results in a much smaller memory footprint for partitioning than with, say, METIS. The HDRF [43] algorithm was recommended as particularly suitable for graphs with power-law degree distribution [42], which is often present in RDF datasets. Moreover, the 2PS [37] algorithm was shown to sometimes outperform HDRF. Thus, we use HDRF and 2PS as the basis for our work.
Streaming partitioning algorithms, however, were designed for general (directed or undirected) graphs, so applying them straightforwardly to RDF is unlikely to produce good results: facts with the same subject would not necessarily be placed on the same server, which, as we explained in Section 3, is critical. To remedy this, we adapt HDRF and 2PS to take the specifics of RDF into account, but without compromising the quality of the produced partitions.

The HDRF 3 Algorithm
We now present our HDRF 3 algorithm for streaming partitioning of RDF data. We follow the 'high degree replicated first' principle from the HDRF algorithm for general graphs [43]. In Section 5.2.1 we briefly discuss the original idea, and in Section 5.2.2 we adapt these principles to RDF.

The Original HDRF Algorithm
The HDRF algorithm was developed for scale-free undirected graphs, where the distribution of vertex degrees exhibits (or is close to) the power-law distribution. Such graphs contain few high-degree vertices, and many low-degree vertices. HDRF aims to replicate (i.e., assign to more than one server) vertices with higher degrees so that a smaller number of vertices has to be replicated overall. The algorithm processes sequentially the edges of the input graph and assigns them to servers. For each server ∈ {1, … , }, the algorithm maintains the number of edges currently assigned to server ; all of are initialised to zero. Moreover, for each vertex , the algorithm maintains the partial degree ( ) and the partial replication set ( ) in the subgraph processed thus far. For each vertex , the degree ( ) is initialised to zero, and ( ) is initialised to the empty set. To assign an undirected edge { , }, the algorithm first increments ( ) and ( ), and then for each candidate server ∈ {1, … , } it computes the score ( , , ). Finally, the algorithm sends the edge { , } to the server with the highest score ( , , ), increments , and updates sets ( ) and ( ) to contain .
The score ( , , ) consists of two parts, ( , , ) and ( ). The former estimates the impact that placing { , } on server has on replication and is computed as To understand the intuition behind this formula, assume that vertex occurs only on server , vertex occurs only on server ′ , and ( ) > ( ). We then have ( , , ) < ( , , ′ ), which ensures that edge { , } is sent to server ′ -that is, vertex is replicated to server ′ , in line with our desire to replicate higher-degree vertices. The sum ( ) + ( ) in the denominator of the formula for ( , , ) normalises the degrees of and .
Considering ( , , ) only would risk producing partitions of unbalanced sizes. Therefore, the second part of the score is used to favour assigning edge { , } to the currently least loaded server using formula where and are the maximal and minimal, respectively, possible partition sizes.
Scores ( , , ) and ( ) are finally combined using a fixed weighting factor as Thus, allows us to control how important is balancing partition sizes versus achieving low replication factors. The version of the algorithm presented above makes just one pass over the graph edges, and ( , , ) and ( , , ) are computed using the partial vertex degrees (i.e., degrees in the subset of the graph processed thus far). The authors of HDRF also discuss a variant where exact degrees are computed in a preprocessing pass, and they show empirically that this does not substantially affect the partition quality.

Adapting the Algorithm to RDF Graphs
Several problems need to be addressed to adapt HDRF to RDF. A minor issue is that RDF facts correspond to labelled directed edges, which we address by ignoring the predicate component of facts. A more important problem is to ensure that all facts with the same subject are placed on the same server. To achieve this, we compute the destination for all facts with subject the first time we see such a fact.
The pseudo-code of HDRF 3 is shown in Algorithm 2. It takes as input a parameter determining the maximal acceptable imbalance in partition element sizes, the balance parameter as in HDRF, and another parameter that we describe shortly. The algorithm uses a preprocessing pass over (not shown in the pseudo-code), where it determines the size of the dataset | |, and the out-degree | + ( )| and the degree | ( )| of each constant in . The algorithm also maintains (i) the replication set ( ) for each constant, which is initially empty, (ii) a mapping of constants occurring in subject position to servers, which is initially undefined on all constants, and (iii) the numbers 1 , … , and 1 , … , of facts and constants, respectively, assigned to servers thus far, all of which are initially set to zero.
Our algorithm uses the PROCESSFACT function to assign each fact ⟨ , , ⟩ ∈ to a server. Mapping keeps track of the server that will receive facts with a particular subject. Thus, if ( ) is undefined (line 53), the algorithm sets ( ) to the server with the highest score (line 54) in a way analogous to HDRF. All facts with the same subject encountered later will be assigned to server ( ), so counter ( ) is updated with the out-degree of (line 55). Finally, the fact is sent to server ( ) (line 56), and the replication sets of and and the number of constants ( ) on server ( ) are updated if needed (lines 57 and 58).
The score of sending fact ⟨ , , ⟩ to server is calculated as in HDRF. The replication part of the score is computed in lines 61 and 62. Unlike the original HDRF algorithm, the first time we encounter a constant , we determine the target server for all facts with subject in the rest of the input; thus, knowing the degree of in advance allows us to take into account the impact of all future allocations of facts with subject to server ( ). Moreover, we observed empirically that reasoning tends to be faster when partition elements have roughly similar average constant degrees. Function DEG estimates the current average degree of constants in server as a quotient of the currently numbers of facts ( ) and constants ( ) assigned to server . Then, in lines 61 and 62, is updated only if the average degree of server is close (i.e., within the range defined by the parameter ) to the minimal average degree.
Line 63 computes the balance factor by observing that a partition element can have at most | |∕ facts.
Finally, and are combined using in line 64. Unlike in the original HDRF algorithm, factor ∑ ∕| | ensures that partition balance grows in importance towards the end of partitioning.
As we mentioned in Section 2, balancing partition sizes while minimising the replication factor is computationally hard, so the minimality requirement is typically dropped. The following result shows that Algorithm 2 honours the balance requirements, provided that and are chosen in a particular way. The proof is given in Appendix B.
Proposition 5.1. Algorithm 2 produces a partition that satisfies | | ≤ | | for each 1 ≤ ≤ whenever and are selected such that

The 2PS 3 Algorithm
We now present our 2PS 3 algorithm for RDF, which adapts the two-phase streaming algorithm 2PS [37]. In Section 5.3.1 we discuss the original idea, and in Section 5.3.2 we discuss how we apply these principles to RDF.

The Original 2PS Algorithm
The 2PS algorithm processes undirected graphs in two phases. In the first phase, the algorithm clusters vertices into communities aiming to place highly connected vertices into a single community. This is achieved by initially assigning After all edges are processed in the first phase, the identified communities are greedily assigned to servers. Then, the graph is processed in the second phase, and edges are assigned to the communities of their vertices.

The Algorithm
Just like in the case of HDRF, the main challenge in extending 2PS to RDF is to deal with the directed nature of RDF facts, and to ensure that facts with the same subject are assigned to the same server.
The pseudo-code of 2PS 3 is shown in Algorithm 3. As in HDRF 3 , the algorithm uses a preprocessing phase to determine the size of dataset | | and the out-degree | + ( )| of each constant . Thus, our 2PS 3 algorithm actually uses three phases; however, to stress the relationship with the 2PS algorithm, we call the algorithm 2PS 3 .
The algorithm maintains a global mapping of constants to communities-that is, ( ) is the community of each constant . Thus, two constants 1 and 2 are in the same community if ( 1 ) = ( 2 ). Initially, each constant is assigned to its own community . As the algorithm progresses, the image of contains fewer and fewer communities. Once communities are assigned to servers, a fact ⟨ , , ⟩ is assigned to the server of community ( ); thus, facts with the same subject are assigned to one server.
The algorithm also maintains a global function that, for each community , keeps track of the number ( ) of facts whose subject is assigned to community . Thus, ( ) is initialised as | + ( )| for each constant .
After initialisation, each fact ⟨ , , ⟩ ∈ is processed using function PROCESSFACT-PHASE-I. In line 68, the algorithm compares the sizes ( ( )) and ( ( )) of the communities to which and , respectively, are currently assigned. It identifies as the constant whose current community size is larger, and as the constant whose current community size is smaller (ties are broken arbitrarily). The aim of this is to move into the community of , but this is done only if, after the move, we can satisfy the requirement on the sizes of partition elements: if each community contains no more than ( − 1) | | facts, we can later assign communities to servers greedily and the resulting partition elements will contain fewer than | | facts. This is reflected in the condition in line 68: if satisfied, the algorithm updates the sizes of the communities of and (lines 70-71), and it moves into the community of (line 72). If desired, can be processed several times using function PROCESSFACT-PHASE-I to improve the community structure.
After is processed, function ASSIGNCOMMUNITIES assigns communities to servers. To this end, for each server , the algorithm maintains the number of facts currently assigned to partition element . Then, the communities from the image of (i.e., the communities that have 'survived' after shuffling the constants in the first phase) are assigned by greedily preferring the least loaded server. Finally, using function PROCESSFACT-PHASE-II, each fact ⟨ , , ⟩ ∈ is assigned to the server of community ( ). As in HDRF 3 , our algorithm is not guaranteed to minimise the replication factor. However, the following result shows that the algorithm will honour the restriction on the sizes of partition elements for a suitable choice of . The proof is given in Appendix C.

Evaluation
To empirically evaluate the techniques we presented in this paper, we implemented a prototype distributed Datalog reasoner called DMAT. To store and manage triples in RAM, we have reused the storage subsystem of RDFox [40], a stateof-the-art centralised RDF system. The storage subsystem of RDFox maintains exhaustive hash-based indexes as described by Motik et al. [40] to support efficient enumeration of triples matching a single atom; for example, given an atom ⟨ , , ⟩, it can efficiently provide all values for that instantiate the atom to a triple in the locally stored dataset. On top of this basic data storage facility, we implemented a mechanism for associating triples with timestamp and the EVALUATE function by first identifying candidate matches using the functionality provided by RDFox and then filtering the matches according to timestamps. Finally, we implemented our algorithms from scratch (i.e., without reusing any algorithms from RDFox). Our prototype is written in C++. At the moment, DMAT can run our materialisation algorithm on just one thread: the need to synchronise threads on one server introduced considerable complexity to our implementation, so we decided to leave this aspect for future work. DMAT can partition the data using subject hashing, a variant of min-cut partitioning by Potter et al. [45] that we call METIS, and the HDRF 3 and 2PS 3 algorithms described in Sections 5.2 and 5.3, respectively.
We evaluated DMAT by conducting four sets of experiments. First, we investigated how using different partitioning schemes, including HDRF 3 and 2PS 3 , affects the performance of materialisation. Second, we investigated the extent to which the performance of our algorithm depends on network speed. Third, we studied how the performance of materialisation changes when the input data and the number of servers increase. Fourth, we compared DMAT with BigDatalog and Cog, two state-of-the-art systems that use static data exchange strategies.
We next present our experimental setting and discuss the results. In particular, we introduce our datasets in Section 6.1, we discuss our test setting in Section 6.2, and then we present the data partitioning tests in Section 6.3, the network speed tests in Section 6.4, the scalability tests in Section 6.5, and the system comparison tests in Section 6.6. Considering data partitioning first will allow us to identify 2PS 3 as the partitioning strategy that seems to offer the best performance on average, so we use 2PS 3 in the remaining two tests. The DMAT system and all datasets and programs used in our experiments are available online. 3

Datasets
In this section we discuss the datasets we used in our experiments. Table 2 summarises some basic information about the programs used. The sizes of the input datasets varied in each test, so we report the data sizes when discussing the relevant experiments. All programs and datasets are available from the mentioned Web page. The LUBM [20] benchmark has been extensively used to test the performance of RDF systems. We generated datasets of varying sizes using LUBM's data generator, and we used the extended lower bound Datalog program by Motik et al. [40] designed to stress-test reasoning systems. The program was obtained by translating the OWL 2 RL portion of the LUBM ontology into Datalog and manually adding several recursive rules that produce many redundant derivations. To the best of our knowledge, this program has not yet been used in the literature to test the performance of distributed RDF reasoners, and it provides us with more insights than the standard, relatively 'easy' lower bound program.
The WatDiv [6] benchmark was developed for testing the performance of query answering in RDF systems. It comes with a data generator that can produce datasets in which the degrees of resources follow a power law distribution. Such datasets are challenging to both query answering and partitioning algorithms, which makes WatDiv highly relevant to our setting. However, WatDiv does not include an ontology or a Datalog program, so we manually created a program consisting of 32 chain, cyclical, and recursive rules.
The Microsoft Academic Knowledge Graph [13] is an RDF translation of the Microsoft Academic Graph-a heterogeneous dataset of scientific publication records, citations, authors, institutions, journals, conferences, and fields of study. The original MAKG dataset contains 8 billion triples and includes links to datasets in the Linked Open Data Cloud. A significant portion of the dataset consists of triples with datatype properties providing annotations such names, publication dates, various counts, and so on. Such triples are not interesting for reasoning as they do not define the graph structure, but they increase the memory requirements on servers and data loading times. Thus, to make experimentation practical, we selected 3.67 billion triples with 'structural' properties such as , ℎ , and so on; the list of all selected properties is available on the aforementioned Web page. We call the resulting dataset MAKG * . The dataset does not come with an ontology or a Datalog program, so we manually created a program consisting of 15 chain, cyclical, and recursive rules.
As far as we know, this is the first time that WatDiv and MAKG were used to benchmark Datalog reasoning.

Test Setting
We ran our experiments on the Amazon EC2 cloud. For each server in the cluster, we used one r5.4xlarge instance equipped with a 3.1 GHz Intel Xeon Platinum 8000 series (Skylake-SP or Cascade Lake) processor, two virtual CPUs, 128 GB of RAM, and running Linux kernel 4.14. The servers were connected by Ethernet that, according to Amazon, can support speeds up to 10 Gbps. For the experiments with DMAT, the disk space was irrelevant since our system stores data in RAM. For the experiments with BigDatalog and Cog, we equipped each server with 1 TB of Amazon Elastic Block Storage (EBS) to be able to run Spark and Flink. Finally, METIS requires loading the entire dataset into memory, so we used one r5.24xlarge server with 784 GB of RAM to partition the data in the experiments with METIS.
Two virtual CPUs per server were sufficient for our experiments: as we already mentioned, DMAT can currently run our algorithm only on one thread; thus, to ensure a fair comparison, we configured BigDatalog and Cog to use just one worker thread per server. In our data partitioning and system comparison experiments, we computed the materialisation using DMAT, BigDatalog, and Cog on ten servers. In our scalability experiments, we scaled the number of servers proportional to the input size. In all experiments with DMAT, we used one additional 'master' server whose role was to distribute the facts and the program to the servers and initiate materialisation. We used the 'master' server mainly for convenience: this server never participated in materialisation, so its use in the initialisation phase does not affect our results in any substantial way.

Data Partitioning Experiments
The objective of our partitioning experiment was to see how different data partitioning strategies affect the performance of materialisation. To this end, we compared the HDRF 3 and 2PS 3 algorithms from Section 5 with subject hashing and theMETIS variant of min-cut partitioning by Potter et al. [45]. The latter approach uses weighted graph partitioning to balance the number of triples, rather than the number of resources on different servers.
As we mentioned in Section 3.3, several approaches have been proposed in the literature that replicate facts to more than one server [27,21,33,5]. However, the main objective of our work was to avoid repetition of derivations in a distributed setting, which seems incompatible with data replication. As a result, our reasoning algorithm requires all partition elements to be pairwise disjoint, and so we cannot include any data partitioning strategy that involves data replication in our evaluation.
Test Setting. We used the LUBM dataset for 10K universities, the WatDiv-1G dataset provided by the authors of WatDiv, and the MAKG * dataset in this test; for each dataset, Table 3 shows the number of resources, the numbers of input and output triples, and the number of derivations (i.e., the number of facts derived in line 23 before duplicate elimination). To speed up loading, we preprocessed all datasets by replacing all resources with integers. With Hash, HDRF 3 , or 2PS 3 , the 'master' server processed the data in a streaming fashion and distributed the triples to the ten materialisation servers, and then it started the materialisation by distributing the Datalog program. With METIS, the precomputed partition elements were loaded directly into the materialisation servers, and the 'master' just distributed the Datalog program. To hash the triples' subjects, we simply multiplied the integer subject value by a large prime in order to randomise the distribution. In HDRF 3 and 2PS 3 , we used = 1.25. Also, in HDRF 3 , we used = 0.25 and was set to the lowest value satisfying Proposition 5.1; the values of thus vary for each dataset and are shown in Table 3. We processed the datasets twice in the first phase of 2PS 3 . We recorded the wall-clock time and the number of messages sent on each server during materialisation. For each test, Table 3 shows the minimum, maximum, and median numbers of triples in partition elements (given as percentages of the number of input triples), the replication factor (see Section 2 for a definition), the partitioning and reasoning times, and the number of nonlocal messages (i.e., the number of messages that were sent over the network).
Partition Times and Balance. All partitioning schemes produced partition elements with sizes within the tolerance parameters: Hash achieves perfect balance if the hash function is uniform; METIS explicitly aims to equalise partition sizes; and our algorithms do so by design and the choice of parameters. For all streaming methods, the partitioning times were not much higher than the time required to read the datasets from disk and send triples to their designated servers. In contrast, METIS partitioning took longer than materialisation on LUBM-10K and WatDiv-1G, and on MAKG * it ran out of memory even though we used a very large server equipped with 784 GB of RAM.

Replication, Communication, and Reasoning.
Generally lowest replication factors were achieved with 2PS 3 : only METIS achieved a lower value on WatDiv-1G, and HDRF 3 achieved a comparable value on MAKG * . The replication factor of Hash was highest in all cases, closely followed by HDRF 3 . Moreover, lower replication factors seem to correlate closely with decreased communication overhead; for example, the number of messages was significantly smaller on LUBM-10K and MAKG * with 2PS 3 than with other schemes. This reduction seems to generally lead to shorter reasoning times: 2PS 3 was faster than all other schemes on LUBM-10K and MAKG * ; for the former, the improvement over Hash is by a factor of 2.25. However, the reasoning times do not always correlate with the replication factor: on WatDiv-1G, METIS and 2PS 3 were slower than Hash and HDRF 3 , despite exhibiting smaller replication factors.
Workload Imbalance. To further analyse the results of our experiments, we show in Figure 1 the numbers of derivations and the total size of partial messages processed by each of the ten servers in the cluster. As one can see, the numbers of derivations and messages per server are quite uniform for Hash and, to an extent, for HDRF 3 ; in contrast, with 2PS 3 and METIS, certain servers seem to be doing much more work than others, particularly on WatDiv-1G and MAKG * . Thus, reducing communication seems to matter only up to a point. For example, 2PS 3 reduces communication by about an order of magnitude on LUBM-10K, which, combined with a uniform workload distribution, seems to 'pay off' in terms of reasoning times. On MAKG * , 2PS 3 reduces communication by about a factor of two, while HDRF 3 seems to distribute the workload more evenly. Combined, these factors lead to more modest (yet still significant) improvements in reasoning times for 2PS 3 . Finally, on WatDiv-1G, communication overhead does not appear to be significant with any partitioning strategy, so the reasoning times seem to be determined mainly by the workload imbalance.
Graph Structure. LUBM-10K data is organised into universities, where most triples connect entities within one university; thus, the data seems to naturally decompose into modules of roughly the same sizes. The 2PS 3 algorithm seems to exploit this modular structure very well, allowing it to reduce the communication overhead by an order of magnitude. In contrast, WatDiv-1G and MAKG * do not seem to decompose into modules as easily. In fact, WatDiv-1G was specifically designed to produce RDF graphs where the vertex degrees follow a power-law distribution; such datasets have irregular structure and highly variable constant degrees, which makes partitioning difficult. The original HDRF algorithm was identified as particularly suitable for such graphs [42]. Our results seem to agree: HDRF 3 offers the best materialisation times on WatDiv-1G.
Overall Performance. In general, 2PS 3 seems to provide a good performance mix: unlike METIS, it can be implemented without placing unrealistic requirements on the servers used for partitioning; it can significantly reduce communication, particularly on highly modular graphs; and the resulting partition can lead to workload imbalances, but these do not appear to be excessive. Thus, 2PS 3 seems like a good alternative to the thus far dominant hash partitioning, and therefore we use it in our scalability (Section 6.5) and system comparison experiments (Section 6.6). The HDRF 3 algorithm seems to be worth considering on graphs with a power-law vertex degree distribution.

Effects of Network Speed
Although 10 Gbps network speeds are often found in modern data centres, it is nevertheless interesting to see how our techniques perform on older and slower networks. Since our main objective is to reduce network communication, one can expect performance gains from locality-aware processing to be more pronounced with slower networks. To answer this question, we artificially slowed down the network of our servers using the tc Linux command and then materialised the WatDiv-1G dataset partitioned using the 2PS 3 algorithm. We chose WatDiv-1G-1G because, as we discuss in Section 6.3, the number of local messages during materialisation is lowest on this dataset, so a slower network is more likely to affect reasoning times. Table 4 shows the materialisation times for four network speeds.
As one can see, materialisation times grow as network speed decreases, but the increase is sublinear: if we slow down the network by a factor of 19.5, the materialisation time increases only by a factor of two. This suggests that network communication is an important, but not the only factor determining the performance of our algorithm. Moreover, materialisation times increases only by 35% on still relatively common 1 Gbps network, showing that our techniques are applicable to commodify hardware.

Scalability Experiments
The main objective of data distribution is scalabilitythat is, the ability to process increasing data loads without a significant increase in processing times by proportionally extending the cluster. Note, however, that the size of the input data is not always representative of the work needed to compute the materialisation. For example, applying the rule ⟨ , , ⟩ ∧ ⟨ , , ⟩ → ⟨ , , ⟩ to a dataset consisting of triples ⟨ 1 , , 2 ⟩, ⟨ 2 , , 3 ⟩, … , ⟨ , , 1 ⟩ derives 2 triples, and it requires matching the rule body in 3 ways; thus, materialisation time is likely to depend cubically on the input size on this example. We therefore analyse the scalability of DMAT in terms of two natural and complementary ways to measure the amount of work.
Work Measures. The number of derivations is a good measure of the amount of work for the following reasons. First, this number is equal to the number of answers obtained by evaluating the bodies of all rules as queries over the materialisation, which is fixed for every dataset-that is, the number of derivations does not depend on the materialisation algorithm. Second, duplicate facts can be eliminated in constant amortised time, so the number of derivations also estimates the amount of work for duplicate elimination. Hence, this is a natural measure for seminaïve evaluation, where each derivation is made exactly once.
In addition, we shall also consider the number of messages produced during materialisation. If most partial answers lead to a derivation (and our experience suggests that this is frequently the case), the number of messages is much smaller than the number of derivations. However, this is not necessarily so; for example, in a chain rule, the join of the initial two atoms can produce many partial answers that do not 'survive' a join with the third atom; thus, computing the partial answers in the join of the first two atom then dominates the performance of reasoning and should be taken into account. The main drawback of measuring the work in terms of the number of messages is that this number is determined not only by the dataset and the rules, but also by the order of atoms in rule bodies.
Test Setting LUBM and WatDiv are ideally suited to this experiment as we can scale the input datasets in a controlled manner. Thus, we generated LUBM datasets for 2K, 4K, 8K, and 10K universities, and WatDiv datasets with roughly 200 M, 400 M, 800 M, and 1 G triples. In contrast, MAKG * is a real-world dataset that cannot be scaled easily, so we did not consider it in this experiment. We used test setting described in Section 6.3, but we scaled the number of servers proportionally to the input size. We conducted all experiments using the 2PS 3 partitioning strategy.
Results. The results of our scalability experiments are summarised in Table 5. For each dataset, we report the cluster size, number of resources in the input, the numbers of triples in the input and output, the materialisation time, the number of derivations, the derivation rate (i.e., the number of derivations per server per second), the number of messages, the percentage of messages that are local to a server (i.e., that are not sent over the network), and the total reasoning rate (i.e., sum of the numbers of derivations and messages processed per server per second).

Discussion.
In the two benchmarks, the amount of work scales differently with the input size. On LUBM, the number of messages is an order of magnitude smaller than the number of derivations; hence, the benchmark is 'wellbehaved' in the sense that most partial answers contribute to a derivation. Also, the overwhelming majority of messages are local. This is unsurprising because each university in a LUBM dataset contains roughly the same number of triples, and there are relatively few connections between universities. Thus, the number of derivations scales roughly linearly with the input size, which allows DMAT to exhibit near-constant derivation and reasoning rates.
In contrast, the number of messages on WatDiv is much larger than the number of derivations, it scales superlinearly with the input size, and the percentage of local messages decreases steeply as the input grows. We conjecture that this is because WatDiv datasets exhibit a highly irregular structure, so the difficulty of partition increases with the dataset size. As a result, the derivation rate drops significantly as the dataset increases in size: it is about 3.6 lower on the 1 G dataset than on the 200 M dataset. The reasoning rate also drops, but by a smaller factor of only 2.1. Nevertheless, note that the overall amount of data increases by a factor of five, so the decrease in the performance is still below the increase in overall data size.
To summarise, our reasoning approach seems to scale well when the overall work scales linearly with input size, and increasing the input size does not create a highly connected dataset that is difficult to partition. However, even in the latter case, our approach is still able to materialise large datasets with complex, recursive rule sets.

System Comparison Experiments
To see how DMAT compares to the state of the art, we compared it to BigDatalog [50] and Cog [28], which are based on Spark and Flink, respectively. We are grateful to the authors of both systems for their extensive assistance.
Test Systems. We obtained the source code of BigDatalog and Cog from public repositories and compiled it ourselves. Both systems rely on Apache Calcite, 4  source framework for building databases and data management systems, to compile logical plans into SQL and recursion operators. This framework, however, does not seem to correctly handle arbitrary recursive Datalog programs, and it also could not process larger programs. After extensive experimentation and discussion with the systems' authors, we were able to compile only small linear programs. To overcome this setback, in this experiment we selected from each Datalog program two rules, one of which was recursive.

an open
As we mentioned in Section 3.1, BigDatalog and Cog are not classical materialisation systems; rather, they take as input a query and then materialise only a part of the program relevant to the query. Thus, when running BigDatalog and Cog, we used ⟨ , , ⟩ as a query, where is the property computed by the two rules. Moreover, BigDatalog and Cog are based on the standard relational data model, rather than the RDF model. We used the well-known vertical partitioning technique [1,8] to transform triples into relations: we introduced a binary relation for each property occurring in the input dataset or a rule; we converted each input triple ⟨ , , ⟩ to tuple ⟨ , ⟩ in relation ; and we transformed each rule or query atom ⟨ , , ⟩ into a relational atom ( , ), which was possible because is never a variable. We also eliminated from the input datasets all triples of the form ⟨ , , ⟩ where does not occur in the body of the two rules; such triples cannot be matched by the rules so they do not contribute to materialisation, and this data reduction made our experiments more practical. Table 6 shows the numbers of input and output triples in the reduced datasets.
Test Setting. We conducted our experiments as follows. For BigDatalog and Cog, we copied the input datasets into the local directories of all ten materialisation servers, we instructed the systems to answer the query (which involved materialising the rules), and we measured the wall-clock time required. For DMAT, we used the test setting described in Section 6.3, but modified to use the two-rule program. In all tests, we measured the total amount of data sent over the network using the ifconfig command. Table 6 shows, for each test, the materialisation time and the total amount of data sent over the network.

Results.
For DMAT, the table also shows the number of derivations, the number of messages, and the percentage of messages that are local; note that such metrics do not apply to BigDatalog and Cog. On MAKG * , BigDatalog ran out of memory, and Cog could not compile the program.

Discussion.
As one can see from the table, DMAT consistently outperformed both systems. The difference is not as pronounced on WatDiv, but it is by a factor of three or more on LUBM. Moreover, even the reduced MAKG * program with just two rules was too complex for other systems: BigDatalog ran out of memory despite each server being equipped with 1 TB of storage that Spark could use for scratch data. While it is hard to be absolutely sure about the cause, we conjecture that this is because Spark evaluates queries bottom-up and materialises the intermediate results, which can be costly. In contrast, DMAT uses nested index loop joins, and it processes local messages without storing them: only partial answers that need to be sent to another server are 'materialised' in the sense that they are explicitly created and stored (e.g., in buffers of the networking stack). As a result, our rule evaluation strategy can be less memoryintensive when the rules are complex.
Our technique seems to be most effective when data can be partitioned well. As we observed in Section 6.3, the 2PS 3 algorithm seems to detect the modular structure of LUBM, which considerably reduce communication. Indeed, most messages are local on LUBM, so materialisation introduces very little communication. On WatDiv, 2PS 3 is not as effective in reducing communication: only 12% of messages are local, which makes reasoning about three times slower than on LUBM, and leads to an order of magnitude more communication than with the other two systems. On MAKG * , our techniques also seem to be very effective at reducing communication. In contrast, data partitioning in BigDatalog and Cog is not locality-aware, which incurs more communication and longer reasoning times; moreover, the difference in performance on LUBM and WatDiv is smaller because neither dataset is partitioned to explore locality. Note: TX is the total amount of data in GB transmitted over the network in the cluster.

Conclusion and Further Work
In this paper we have presented a novel approach to Datalog reasoning in distributed RDF systems. Our materialisation algorithm supports arbitrary Datalog rules over RDF data and arbitrary partitions of the input dataset. To the best of our knowledge, no other system has these traits. We have also presented two new streaming partitioning algorithms that enable our reasoner to process large datasets. We have shown experimentally that our techniques can considerably reduce communication, scale well in many cases, and are competitive with regard to the state of the art.
In our future work, we plan to extend our implementation with support for multi-threaded processing in the servers to improve parallelism. We will also aim to further improve materialisation performance by reducing imbalances in the workload among servers. One possibility might be to analyse the Datalog program before partitioning and thus identify workload hotspots. Furthermore, we aim to support more advanced features of Datalog, such as stratified negation and aggregation, which are needed in many practical applications. Another question that, to the best of our knowledge has not been considered in the literature thus far, is to efficiently support distributed incremental reasoning. and Data Engineering 7, 163-176.

A. Proofs for Section 4
Theorem 4.1. Let be a dataset, let be a Datalog program, and let 1 , … , be the datasets obtained by applying Algorithm 1 to an arbitrary partition of as specified in this section. Then, ∞ ( ) = 1 ∪ ⋯ ∪ , and moreover the algorithm exhibits the nonrepetition property.
Throughout this section, we fix an arbitrary Datalog program , input dataset , partition  of , and a run of Algorithm 1 on  as specified in Section 4. We assume that the result of the algorithm's execution is equivalent to a run where all lines on all servers are executed in some sequential order-that is, we assume that our computation is sequentially consistent. Thus, each line of the algorithm is executed at some time instant where ≥ 0, and time instant zero refers to the algorithm's start. This allows us to talk about some data structure (e.g., , for some ) at time instant , which is the state of the data structure just after the line at time instant was executed. We next introduce several useful definitions.
We say that a fact occurs at time instant on some server if server contains at that time instant. A constant occurs at time instant on server at position ∈ Π if there exists a fact that occurs at time instant on server such that | = . A constant is relevant at time instant to server if occurs in the head of a rule in or in server at any position at time instant . We use 'occur initially' and 'occur eventually' to refer to the time instant zero and the time instant at algorithm's termination, respectively. Fact is derived on server if does not occur initially on server , but it occurs eventually on server . For each position ∈ Π and each constant that occurs eventually on some server, we define the set of servers ( ). If occurs in the head of some rule in , we let the message at instant ′ . The set ( ) was initialised in line 17 as , ( ) on some server at some instant before ′ , and moreover constant was obtained by matching a fact that occurs on server at instant ′ ; thus, occurs on server at time instant ′ . Consequently, the induction assumption ensures ( ) ⊆ , ( ) at instant ′ , which in turn ensures ( ) ⊆ , ( ) at instant , as required.
Lemma A.2. For all servers and , each fact , each constant , and all position and ′ such that ∈ ( ) and introduces at position on server , an event of type ( , , ′ , ) occurs on server during the run.
Proof. Consider arbitrary , , , , , and ′ as specified in the lemma. By the definition of 'introduces', fact does not occur initially on server , so there exists a time instant such that event ( ) occurs on server . Because of the algorithm order, event ′ ( ) occurs on server at some time instant ′ with ′ < . Let , be the occurrence mapping at instant ′ . Then, we have ∉ , ( ) because is the fact that introduces at position on server , so the algorithm executes lines 32-36 for position . We have the following two possibilities.
• If occurs in the head of a rule in , then all the servers of the cluster are added to the set in line 34, which ensures ( ) ⊆ .
• If does not occurs in the head of a rule in , then was matched to a variable in a body atom of some rule in on some server, and also occurs in the rule head. Let , , and be the partial occurrences for the message at instant ′ ; clearly, includes the sets ( ), ( ), and ( ). These sets were copied from the occurrence mappings ′ , ′′ on some server ′ at some time instant when occurs in ′ , so Lemma A.1 ensures ( ) ⊆ ⋃ ′′ ∈Π ′ , ′′ ( ); consequently, ( ) ⊆ ⋃ ′ ∈Π ′ ( ) holds. Set is extended in line 36 with these partial occurrences, which clearly ensures ( ) ⊆ .
Either way, set includes ( ); thus, ∈ ( ) implies ∈ . Lines 38 and 47 ensure that an message for is sent to every server in , so line 42 ensures that an event of type ( , , ′ , ) occurs on server .
Second, we show that the consistency of occurrence mappings is maintained as the computation progresses. We will later use this to ensure that partial answers are sent to all relevant servers that can possibly match an atom. partial occurrence mapping from the message, so clearly ∈ , ( ) holds.
Case 4: Constant is not initially relevant for and server , but occurs initially on server at position . Then, ∈ ( ) holds. Let be the fact that introduces in any position on server . Then, there exist a position ′ and an instant before the current one such that at which ( , , ′ , ) occurs on server . Let be the partial occurrence mapping for position in the message at instant ′ . Set ( ) was read from some server ′ at some time instant when occurs in ′ , so Lemma A.1 ensures that ( ) includes ( ); thus, line 42 ensures that present in the occurrence mappings on server at instant , and so ∈ , ( ) holds.
Third, we show that the occurrence mappings for the subject position are maintained in a way that ensures that all facts with the same subject are stored on the same server. Proof. The proof is by induction on the time instants in the algorithm's run. The base case at time instant zero holds because the occurrence mappings are initially correct. For the induction step, we consider a time instant > 0 such that claim holds at all instants ′ with ′ < , and we show that the claim holds at time instant as well.
The claim holds vacuously if no occurrence mapping changes at time instant . Thus, we assume that there exist servers and , constant , and fact such that is derived on server and contains , and event ( , , , ) occurs at time instant . Let , be the occurrence mapping on server at instant −1, and let be the partial occurrence mapping from the message being processed at instant . There exist a server ′ and time instant such that ( ) is initially set to ′ , ( ) at instant < . Note that, after initialisation, ( ) can later change only in line 32 or 43. We assumed that the subject occurrence mappings of server change at instant , so ( ) ≠ ∅ holds. We next show that ( ) has exactly one element. For the sake of a contradiction, assume that ( ) contains two or more elements. Now consider an arbitrary ′ ∈ ( ) such that ′ ≠ mod . Then, ′ is not added to via lines 21 and 32, so server ′ must be present in some subject occurrence mapping for at time instant zero. Since occurrence mappings are correct at that instant, constant occurs initially in subject position on some server ′ , and moreover ( ) = { ′ }. Since this holds for arbitrary ′ and ( ) is uniquely defined, ( ) can contain at most two elements: one with value ′ ≠ mod , and another one with value ′′ = mod ; note that ( ) = { ′ } holds, so ′′ ∉ ( )-that is, constant does not occur initially on server ′′ in subject position. Now Lemma A.1 ensures that ( ) ⊆ ′ , ( ) holds at instant , and the induction assumption holds for ′ , at instant , so ′ , ( ) = { ′ } holds at instant . Thus, when the destination server for fact is determined in lines 21-22, the partial occurrence for the subject position for contains exactly ′ , so the destination server is selected in line 22 as ′ . In other words, no new server added to ( ) in line 32 when a message for is processed. If ′′ were added at line 43, then there exist an instant ′′ < and server ′′ such that ′′ ∈ ′′ , ( ) at that instant. The induction assumption holds for instant ′′ , so the set ⋃ 1≤ ≤ , ( ) at instant ′′ contains at most one element; thus, ′′ = ′ , which contradicts our assumption that ( ) has two or more elements. Now if , ( ) = ∅ holds, then adding ( ) to , ( ) clearly does not violate the inductive claim. Thus, assume that , ( ) ≠ ∅. By the induction assumption, , ( ) contains just one server ′′ . The inductive claim clearly holds if ′ = ′′ . For the sake of a contradiction, assume that ′ ≠ ′′ . Then either ′ or ′′ must be different from mod . In the same way as in the previous paragraph, we can conclude that then occurs initially in subject position in both ′ and ′′ , which contradicts our assumption that occurrences are correct at time instant zero.
Lemma A.4 straightforwardly ensures that (ℎ | ) in line 21 contains at most one server, and therefore all facts with the same subject are stored on the same server. Thus, each fact is stored on precisely one server, which ensures that the algorithm's run contains at most one event of the following types for each fact .
• An event of type ( ) can occur at most once because is derived on a server uniquely identified by its subject and duplicates are eliminated.
• An event of type ( ) can occur at most once because of the observation in the previous item, and moreover each fact in a server is processed once by the function PROCESSFACT.
For a fact, we define ( ) as the value of ( ) upon the algorithm's termination where is the server whose partition element contains fact at that instant; since each fact is stored and assigned a timestamp on just one server, there is precisely one such for each fact .
Fourth, we show that the chronological 'happens-before' relationship on the events agrees with the timestamps assigned to the facts involved in the events. ( 2 ) with 1 < 2 . The message from the first event contains a timestamp = ( 1 ). Thus, after the call to SYNCHRONISE in line 12, the value of the local clock is strictly larger than ( 1 ). Given that events of type ( 2 ) do not repeat, fact 2 cannot be added before instant 2 . Thus, when fact 2 is later assigned a timestamp, we have ( 1 ) < ( 2 ).
Consider events 1 ( 1 ) and 2 ( 2 , , , ) such that 1 < 2 and there exists no event 3 ( 2 , , , ) with 3 < 2 . Clearly, the timestamp in first event has the value of ( 1 ), so, after the call to SYNCHRONISE in line 8, the value of the local clock is strictly larger than ( 1 ). If the set of servers that need to be updated by the message for 2 is empty, then 2 is added to the partition element of server and the timestamp of 2 is set to the current value of in line 44, which ensures ( 1 ) < ( 2 ). Otherwise, server attaches the current value of to the new message produced for 2 in line 47. Before 2 is added to some server ′ , server ′ calls SYNCHRONISE in line 40, which ensures ( 1 ) < ( 2 ), as required.
For the rest of this section, we fix 1 , … , as the datasets computed in each server after the algorithm terminates. We now complete the proof of Theorem 4.1 by considering the soundness, completeness, and nonrepetition properties of the algorithm. Lemma A.6. It is the case that 1 ∪ ⋯ ∪ ⊆ ∞ ( ).
Proof. The proof is by induction on the construction of sets . The argument is straightforward so we just present a sketch: when ( , , , ℎ) is returned on some server in line 9, substitution satisfies ∈ ; moreover, as matching of progresses, each substitution ′ returned in line 13 satisfies ′ ∈ ′ ; consequently, each substitution in line 23 is an answer to the annotated query . Thus, each such matches all body atoms of the rule corresponding to ( , , , ℎ) in ∞ ( ), and so we clearly have ℎ ∈ ∞ ( ).

Lemma A.7. It is the case that
Proof. Let ( ) be the sets from the construction of ∞ ( ) as defined in Section 2. The claim follows from the following property: ( * ) for each and each fact ∈ ( ), there exists such that ∈ .
The proof is by induction on . The base case holds trivially, so we assume that ( * ) holds for some ≥ 0 and show that it also holds for + 1. To this end, we consider an arbitrary fact ∈ +1 ( ) ⧵ ( ). This fact is derived by a rule ∶= ℎ ← 0 ∧ ⋯ ∧ ∈ and substitution such that ℎ = and ∈ ( ) for 0 ≤ ≤ . Now choose as the smallest integer between 0 and such that ( ′ ) ≤ ( ) holds for each 0 ≤ ′ ≤ . Let 0 , … , be the body atoms of the rule rearranged so that 0 = is the pivot atom, and the remaining atoms correspond to the annotated query = ⋈ 1 1 ∧ ⋯ ∧ ⋈ returned by MATCHRULES( , ) in line 9 on fact . Finally, for each 0 ≤ ≤ , let be the substitution restricted to all variables occurring in atoms 0 , … , and let = ( ); moreover, ( * ) holds for by the induction assumption, so there exists a server such that ∈ holds. We next prove the following: (◊) for each with 0 ≤ ≤ , a time instant exists when FINISHMATCH( , , , , ℎ, 0 , ) is called for some mapping .
Property (◊) implies our claim because, in lines 18-23, the algorithm then constructs a message for ℎ and dispatches it to some server ℎ , so ℎ is later added to server ℎ in line 44, as required for ( * ).
Assume now that event +1 ( 0 , , , + 1) occurs at some time instant . Server +1 then calls EVALUATE at line 13 for ⋈ +1 +1 . We next show that server +1 contains +1 at the time instant when line 13 is executed. We have the following possibilities.
Moreover, if ( +1 ) = ( 0 ), since 0 = was chosen so that is the least index of a body atom matched to a fact with timestamp ( 0 ), the shape of from (12) ensures that ⋈ +1 = ≤. Thus, the call to EVALUATE in line 13 on server +1 returns +1 , so the call in line 14 ensures (◊).
To complete the proof, we assume that no event of type +1 ( 0 , , , + 1) occurs during the algorithm's run (i.e., server never forwards a message to server +1 ) and show that this leads to a contradiction. Under this assumption, there exists a position ∈ Π such that, for = +1 | , we have +1 ∉ ( ) at the time instant when line 26 is executed on server , and so +1 is removed from . However, this ( ) is populated in line 17 when, for some index 0 ≤ ≤ of an atom , constant is matched ( ) ≤ ( −1)| | holds for each community at any point in time during an algorithm's run. This, in turn, ensures the following property: Thus, (35) holds. In addition, at the end of function ASSIGN-COMMUNITIES, so min ≤ | | because ∑ = | |. This, in turn, ensures In the second phase, each triple ⟨ , , ⟩ is assigned to server ( ( )). But then, equation (34) clearly ensures | | = for each , which implies our claim.