Reformulation-based query answering for RDF graphs with RDFS ontologies

,


Introduction
RDF is the standard model for sharing data and knowledge bases.The rapid increase in number and size of RDF graphs makes efficient query answering on RDF quite a challenging task.Reasoning raises a performance challenge: query answering on an RDF graph no longer reduces to evaluating the query on the graph (by finding all the homomorphisms, or embeddings, of the query in the graph).Instead, it requires taking into account also the possible ontology (or knowledge) rules, which specify how different classes and properties of an RDF graph relate to each other, and may lead to query answers that evaluation alone cannot compute.Moreover, SPARQL, the standard query language of RDF, allows querying the data and the ontology together.This is a radical departure both from relational databases, and from Description Logics (DL)-style models for RDF data and queries.For what concerns reasoning, two main methods have been explored: graph saturation, which injects the ontology knowledge into the graph, and query reformulation, which pushes it into the query.Saturation adds to the graph all the triples it entails through the ontology.Evaluating a query on a saturated graph can be quite efficient; however, saturation takes time to compute, space to store, and needs to be updated when the data and/or ontology rules change.Reformulation leaves the graph unchanged and builds a reformulated query which, evaluated on the original graph, computes all the answers, including those that hold due to entailed triples.Each query reformulation method, thus, targets a certain ontology language and a query dialect.The most expressive RDF fragment for which sound and complete reformulation-based query answering exists is the so-called database fragment [12], in which RDF Schema (RDFS, in short) is used to describe the ontology, while queries only carry over the data triples.
In this work, we present a novel reformulation-based query answering under RDFS ontologies for Basic Graph Pattern (BGP) queries over both the data and the ontology.This goes beyond the closest algorithm previously known [12] which is restricted to queries over the data only (not over the ontology).The algorithm we present here also goes beyond those of RDF platforms such as Jena, Virtuoso or Stardog, which we found experimentally to be incomplete when answering through reformulation queries over the data and the ontology of an RDF graph.Below, we recall some terminology (Section 2) and discuss the state of the art (Section 3).Then, Section 4 introduces our novel query reformulation algorithm, which we implemented in the platform used in [12,9], leveraging an efficient relational database (RDBMS) engine for query answering.Our experiments (Section 5) demonstrate the practical interest of our reformulation approach.

Preliminaries
We present the basics of the RDF graph data model (Section 2.1), of RDF entailment used to make explicit the implicit information RDF graphs encode (Section 2.2), as well as how they can be queried using the widely-considered SPARQL Basic Graph Pattern queries (Section 2.3).

RDF Graph
We consider three pairwise disjoint sets of values: I of IRIs (resource identifiers), L of literals (constants) and B of blank nodes modeling unknown IRIs or literals, a.k.a. to labelled nulls [4].A well-formed triple belongs to (I ∪ B) × I × (L ∪ I ∪ B), and an RDF graph G is a set of well-formed triples.A triple (s, p, o) states that its subject s has the property p with the object value o [1].We denote by Val(G) the set of all values (IRIs, blank nodes and literals) occurring in an RDF graph G, and by Bl(G) its set of blank nodes.
Within an RDF graph, triples model either factual assertions for unary relations called classes and binary relations called properties, or RDFS ontological constraints between classes and properties.The RDFS constraints are of four flavours: subclass constraints, subproperty constraints, typing of the domain (first attribute) or of the range (second attribute) of a property.The triple notations we adopt for RDF assertions and constraints are shown in Table  ), (:p 2 , :hiredBy, :a), (:a, τ, :PubAdmin)} The ontology of G ex , i.e., the first eight triples, states that persons are working for organizations, some of which are public administrations or companies.Further, there exists a special kind of company (modeled by :b C ). Being hired by or being CEO of an organization are two ways of working for it; in the latter case, this organization is a company.The assertions of G ex , i.e., the four remaining triples, states that :p 1 is CEO of :c, which is a company of the special kind :b C , and :p 2 is hired by the public administration :a.A homomorphism between RDF graphs allows characterizing whether an RDF graph simply entails another, based on their explicit triples only: Definition 2 (RDF graph homomorphism).Let G and G be two RDF graphs.A homomorphism from G to G is a function ϕ from Val(G) to Val(G ), which is the identity on IRIs and literals, such that for any triple (s, p, o) in G, the triple (ϕ(s), ϕ(p), ϕ(o)) is in G .
Note that, according to the previous definition, a G blank node can be mapped to any G value.A graph G simply entails a graph G if there is a homomorphism ϕ from G to G , which we denote by G |= ϕ G.

RDF Entailment Rules
The semantics of an RDF graph consists of the explicit triples it contains, and of the implicit triples that can be derived from it using RDF entailment rules.Definition 3 (RDF entailment rule).An RDF entailment rule r has the form body(r) → head(r), where body(r) and head(r) are RDF graphs, respectively called body and head of the rule r.
For instance, the rule rdfs9 applies to the graph G ex : G ex |= ϕ body(rdfs9) through the homomorphism ϕ defined as {s → :b C , o → :Comp, s 1 → :c}, hence allows deriving the implicit triple (:c, τ, :Comp).The saturation of an RDF graph allows materializing its semantics, by iteratively augmenting it with the triples it entails using a set R of RDF entailment rules, until reaching a fixpoint; this process is finite [2].Formally: Definition 4 (RDF graph saturation).Let G be an RDF graph and R a set of entailment rules.We recursively define a sequence (G R i ) i∈N of RDF graphs as follows: -G R 0 = G, and Definition 5 (BGP query).A BGP query (BGPQ) q is of the form q(x) ← P , where P is a BGP also denoted by body(q) and x ⊆ Var(P) is the set of q's answer variables.The arity of q is that of x, i.e., |x|.
The semantics of a (partially instantiated) BGPQ on an RDF graph is defined through homomorphisms from the query body to the saturation of the queried graph.The homomorphisms needed here are a straightforward extension of RDF graph homomorphisms (Definition 2) to also take variables into account.Definition 6 ((Non-standard) BGP to RDF graph homomorphism).
A homomorphism from a BGP q to an RDF graph G is a function ϕ from Val(body(q)) to Val(G) such that for any triple (s, p, o) ∈ body(q), the triple (ϕ(s), ϕ(p), ϕ(o)) is in G.For a standard homomorphism, as per the SPARQL recommendation, ϕ is the identity on IRIs and literals; for a non non-standard one, ϕ is the identity on IRIs, literals and on blank nodes.We distinguish query evaluation, whose result is just based on the explicit triples of the graph, i.e., on BGP to RDF graph homomorphisms, from query answering that also accounts for the implicit graph triples, i.e., based on both BGP to RDF graph homomorphisms and RDF entailment.In this paper, we use two flavors of query evaluation and of query answering, which differ in relying either on standard or on non-standard BGP to RDF graph homomorphisms.Definition 7 ((Non-standard) evaluation and answering).Let q σ be a partially instantiated BGPQ q σ obtained from a BGPQ q and a substitution σ.
The standard answer set to q σ on an RDF graph G w.r.t. a set R of RDF entailment rules is: } where xσ and body(q) σ denote the result of replacing the variables and blank nodes in x and body(q), respectively, according to σ.If x = ∅, q σ is a Boolean query, in which case q σ is false when q σ (G, R) = ∅ and true when q σ (G, R) = { }, i.e., the answer to q σ is an empty tuple.We call q σ (G, ∅) the standard evaluation of q σ on G, written q σ (G) for short, which solely amounts to standard BGP to RDF graph homomorphism finding.
The non-standard answer set, denoted q σ (G, R), and non-standard evaluation q σ (G) of q σ on G w.r.t.R only differ from the standard ones by using nonstandard BGP to RDF graph homomorphisms.
These notions and notations naturally extend to unions of BGPQs.
Example 4. Consider again the BGPQs from the preceding example.Their standard evaluations on G ex are empty because G ex has no explicit :worksFor assertion, while their standard answer sets on G ex w.r.t.R are { :p 1 , :b C } because :p 1 being CEO of :c, :p 1 implicitly works for it, and :c is explicitly a company of the particular unknown type :b C .Consider now the BGPQ q(x) ← (x, :worksFor, y), (y, τ, :b C ).Under standard query answering, it asks for who is working for some kind of organization and its answer set is { :p 1 , :p 2 }; by contrast, under non-standard query answering, it asks for who is working for an organization of the particular unknown type :b C in G ex and its answer set is just { :p 1 }.

Prior related work
Two main techniques for answering BGPQs on RDF graphs have been investigated in the literature.Saturation-based query answering.This technique directly follows from the definition of query answers in the W3C's SPARQL recommendations [3], recalled in Section 2.3 for BGPQs.Indeed, it trivially follows from Definition 7 that q(G, R) = q(G R ) (resp.q(G, R) = q(G R )), i.e., query answering reduces to query evaluation on the saturated RDF graph.Saturation-based query answering is typically fast, because it only requires query evaluation, which can be efficiently performed by a data management engine.However, saturation takes time to be computed, requires extra space to be stored, and must be recomputed or maintained (e.g., [8,7,12]) upon updates.Many RDF data management systems use saturation-based query answering.They either allow computing graph saturation, e.g., Jena and RDFox, or simply assume that RDF graphs have been saturated before being stored, e.g., DB2RDF.Reformulation-based query answering.This technique also reduces query answering to query evaluation, however, the reasoning needed to ensure complete answers is performed on the query and not on the RDF graph.A given query q, asked on an RDF graph G w.r.t.R is reformulated into a query q such that q(G, R) = q (G) or q(G, R) = q (G) holds.Standard or non-standard query evaluation is needed on the reformulated query, depending on the considered RDF fragment: when blank nodes are allowed in RDFS constraints, non-standard evaluation is used [12], while standard evaluation is sufficient otherwise [5,11].Different SPARQL dialects have been adopted for BGPQ reformulation in more limited settings than the one considered in this paper, i.e., the database fragment of RDF and unrestricted BGPQs.Unions of BGPQs (UBGPQs in short) have been used in [5,11,12].However, these works are restricted to input BG-PQs that must be matched on RDF assertions only.BGPQs aiming at interrogating solely the RDFS ontology, or the ontology and the assertions are not considered, even though such joint querying is a major novelty of RDF and SPARQL.The techniques adopt unions of BGPQs [5] or of partially instantiated BGPQs [11,12], depending on whether variables can be used in class and property positions in queries, e.g., whether a query triple (x, τ, z) or (x, y, z) is allowed.Reformulation-based query answering in the DL fragment of RDF, which is strictly contained in the database fragment of RDF, has been investigated for relational conjunctive queries [5,10], while the slight extension thereof considered in [6,11,13,17] has been investigated for one-triple BGPQs [13,17], BGPQs [11], and SPARQL queries [6].In [6], SPARQL queries are reformulated into nested SPARQL, allowing nested regular expressions in property position in query triples.These reformulations allow sound and complete query answering on restricted RDF graphs with RDFS ontologies: these graph must not contain blank nodes.While such nested reformulations are more compact, the queries we produce are more practical, since their evaluation can be delegated to any off-the-shelf RDBMS, or to an RDF engine such as RDF-3X [16] even if it is unaware of reasoning; further, we do not impose restrictions on RDF graphs.In Section 4, we devise a reformulation-based query answering technique for the entire database fragment of RDF and unrestricted BGPQs.Reformulation-based query answering is well-suited to frequently updated RDF graphs, because it uses the queried RDF graph at query time (and not its saturation).However, reformulated queries tend to be more complex than the original ones, thus costly to evaluate.To mitigate this, [9] provides an optimized reformulation framework whereas an incoming BGPQ is reformulated into a join of unions of BGPQs (JUBGPQ in short).This approach being based on a databasestyle cost model, JUBGPQ reformulations are very efficiently evaluated.Some available RDF data management systems use reformulation-based query answering but return incomplete answer sets in the RDF setting we consider 5 , e.g., AllegroGraph 6 and Stardog 7 miss answers because they cannot evaluate triples with a variable property on the schema, while Virtuoso8 only exploits subclass and subproperty constraints, but not domain and range ones.Finally, Hybrid approaches have also been studied, e.g., in [18], where some onetriple queries are chosen for materialization and reused during reformulationbased answering.

Extending query reformulation to queries over the ontology
We now present the main contribution of this paper: a reformulation-based query answering (QA) technique able to compute all answers to a BGPQ against all the explicit and implicit triples of an RDF graph, i.e., its RDF assertions and RDFS constraints, as per the SPARQL and RDF recommendations [3,2].The central idea is to reduce this full QA problem to an assertion-level QA, i.e., where the query is confined to just the explicit and implicit RDF assertions.
To this aim, we divide query reformulation in two steps: the first reformulation step implements the reduction, while the second step relies on the reformulation technique of [12], which considers assertion-level QA.

Overview of our query reformulation technique
Let us first notice that the body of any BGPQ q can be divided into three disjoint subsets of triples (s, p, o), according to the nature of term p: the set b c of RDFS triples where p is a built-in RDFS property (≺ sc , ≺ sp , ← d , → r ); the set b a of assertion triples where p is τ or a user-defined property; and the set b v where p is a variable.We denote by q c , q a and q v the subqueries respectively associated with these bodies.If b v is not empty, q can be reformulated as a union of BGPQs, say Q, composed of all BGPQs that can be obtained from q by substituting some (possibly none) variables occurring in q v with one of the four built-in RDFS properties.We assume this preprocessing step to simplify the explanations, even if in practice it may not be performed.Then, the answers to any BGPQ q ∈ Q can be computed in two steps: 1. compute the answers to the subquery q c , i.e., with body restricted to the RDFS triples; if q c has no answer, neither has q .Otherwise, each answer to q c defines a (partial) instantiation σ of the variables in q .2. compute the assertion-level answers to each partially instantiated query (q a,v ) σ , where q a,v is the subquery with body b a ∪ b v , and return the union of all the obtained answers.To summarize, Step 1 computes answers to RDFS triples, which allows one to produce a set of partially instantiated queries that no longer contain RDFS triples.Hence, these queries can then be answered using RDF assertions only, which is the purpose of Step 2. Our two-step query reformulation follows this decomposition.It furthermore considers a partition of the set R of RDFS entailment rules (recall Table 2) into two subsets: the set of rules R c that produces RDFS constraints and the set of rules R a that produces RDF assertions: -R c = {rdfs5, rdfs11, ext1, ext2, ext3, ext4}; -R a = {rdfs2, rdfs3, rdfs7, rdfs9}.The reason of this decomposition is that query answering remains complete if, on the one hand, only R c is considered to answer queries made of RDFS triples (Step 1: for any graph G, q c (G, R) = q c (G, R c )), and, on the other hand, only R a is considered to answer queries on RDF assertions only, as shown in [12].Query reformulation does not directly work on the entailment rules as classical backward-chaining techniques would do.Instead, a set of so-called reformulation rules is specifically associated with R c (resp.R a ).We can now outline the twostep query reformulation algorithm: Step 1. Reformulation w.r.t.R c : The input BGPQ q is first reformulated into a union Q c of partially instantiated BGPQs, using the set of reformulation rules associated with R c (see Figure 1).This reformulation step is sound and complete for query answering w.r.t.R c , i.e., for any graph G, q(G, R c ) = Q c (G); furthermore, it preserves the answers with respect to the set R, i.e., q(G, Step 2. Reformulation w.r.t.R a : We recall that Q c consists of queries that do not contain RDFS triples.It is given as input to the query reformulation algorithm of [12], which relies on a set of reformulation rules associated with R a to output a union Q c,a of partially instantiated BGPQs.This reformulation step being sound and complete for query answering on the RDF assertions of an RDF graph, we obtain the soundness and completeness of the two-step reformulation, i.e., q(G,

Reformulation rules associated with R c
We now detail reformulation rules associated with R c , see Figure 1.Each reformulation rule is of the form input output , where the input is composed of a triple from a partially instantiated query q σ and a triple from O and the output is a new query obtained from q σ by instantiating a variable, removing the input triple, or replacing it by one or two triples.The notation old triple/new triple(s) means that old triple is replaced by new triple(s).The specific case where old triple is simply removed is denoted by old triple/−.The notations for the triples themselves are the following: a bold character like c, p, s or o represents an IRI or a blank node a v character represents a variable of the query s and o characters represent either variables, IRIs or blank nodes, in subject and object positions respectively.The four rules (1) substitute a variable in a property position by one of the four built-in RDFS properties.All the other rules take as input query triples of the form (s, p, o), where p is a built-in RDFS property.Rule (2) simply removes from q σ an (instantiated) input triple found in O. Query triples with a domain (← d ) or range property ( → r ) are processed by Rules (3)- (11).Given a triple (p, ← , c) in O (where ← stands for ← d or → r ),  Rule (3) replaces a query triple of the form (v 1 , ← , v 2 ) by two triples (v 1 , ≺ sp , p) and (c, ≺ sc , v 2 ).This rule relies on the fact that a triple (p , ← , c ) belongs to the saturation of the RDF graph by R c if and only if p is a subproperty of p (including p = p ) and c is a subclass of c (including c = c ), see Lemma 1 in Section 4.3.However, we do not assume that the ontology ensures the reflexivity of the subclass and subproperty relations, hence Rules ( 4)- (7), whose sole purpose is to deal with the cases c = c and p = p .Should the ontology contain axiomatic triples ensuring the reflexivity of subclass and subproperty, these four rules would be useless.Note that a natural candidate rule to deal with the case where c = c and p = p would have been the following: However, such a rule is flawed: it would blindly consider all triples (p, ← , c) from O, which causes a combinatorial explosion.Instead, we propose Rules ( 10) and (11), which use p and c as guides to replace (p , ← , c ) by other domain / range triples based on the subproperty-chains from p and the subclass-chains to c .Query triples with a subclass (≺ sc ) or subproperty (≺ sp ) property are processed by Rules ( 12)-( 16).Rules ( 12), ( 13), ( 14) instantiate a variable using an ontology triple of the form (c 1 , ≺, c 2 ).In Rule (12), which considers a query triple with two variables and instantiates one of these variables, we arbitrarily chose to instantiate the first variable.The two last rules allow to go up or down in the class and property hierarchies.

Reformulation algorithm associated with R c
The reformulation algorithm itself, denoted by Reformulate c , is presented in Algorithm 1.The set of queries to be explored (named toExplore) initially contains q.Exploring a query consists of generating all new queries that can be obtained from it by applying a reformulation rule (lines 7-9).Newly generated queries are put in the set named produced.The algorithm proceeds in a breadth-first manner, exploring at each step the queries that have been generated at the previous step.When no new query can be generated at a step, the algorithm stops, otherwise the next step will explore the newly generated queries (line 11).Note the use of a set named explored, which contains all explored queries; the purpose of this set is to avoid infinite generation of the same queries when the subclass or subproperty hierarchy contains cycles (other than loops), otherwise it is useless.Importantly, not all explored queries are returned in the resulting set, but only those that no longer contain RDFS triples (lines 5-6).Indeed, on the one hand RDFS triples that contain variables are instantiated by the rules in all possible ways using the ontology, and, on the other hand, instantiated triples that belong to the ontology are removed (by Rule (2)).Finally, note that a variable v in a triple of the form (s, v, o) is replaced by a built-in RDFS property in some queries (by Rule (1)) and left unchanged in others as it may also be later mapped to a user-defined property in the RDF graph G.
A simple analysis of the reformulation rule behavior shows that the worst-case time complexity of algorithm Reformulate c is polynomial in the size of O and The correctness of the algorithm relies on the following lemma, which characterizes the saturated graph G Rc from the triples of G.We call ≺ sc -chain (resp.≺ sp -chain) from s to o a possibly empty sequence of triples (s i , ≺ sc , o i ) (resp.(s i , ≺ sp , o i )) with 1 ≤ i ≤ n, such that s 1 = s, o n = o and, for i > 1, s i = o i−1 .Since we do not enforce the reflexivity of the subclass relation, a triple (c, ≺ sc , c) belongs to G R if and only if there is a non-empty ≺ sc -chain from c to c (which includes the case (c, ≺ sc , c) ∈ G).The same holds for the subproperty relation.Below, we assume without loss of generality that the input query does not contain blank nodes; if needed, these have been equivalently replaced by variables.Therefore, all blank nodes that occur in the output reformulation have been introduced by the reformulation rules, and specifically refer to unknown classes and properties they identify within the ontology at hand.This justifies the subsequent use of non-standard query evaluation and answering in the next theorems.Theorem 1.Let G be an RDF graph with ontology O and q be a BGP query without blank nodes.Let Q c be the output of Reformulate c (q, O).Then: Example 5. Consider the BGPQ asking for how someone is related to some particular kind of company: q(x, y) ← (x, y, z), (z, τ, t), (y, ≺ sp , :worksFor), (t, ≺ sc , :Comp).Its answer set on G ex w.r.t.R, which can be easily checked using (G ex ) R provided in Section 2, is: q(G ex , R) = { :p1, :ceoOf }.
The output of Reformulate c (q, RDFS(G ex )) is: where q and q are obtained by binding, using Rule ( 13), y to either :ceoOf or :hiredBy, and t to :b C .Further, these bindings have also produced the fully instantiated RDFS constraints (:ceoOf, ≺ sp , :worksFor) and ( :b C , ≺ sc , :Comp) in q , as well as (:hiredBy, ≺ sp , :worksFor) and ( :b C , ≺ sc , :Comp) in q , which have then been eliminated by Rule (2).
The non-standard answering of Q c on G ex w.r.t.R, i.e., q (G ex , R) ∪ q (G ex , R) provides the correct answer set { :p 1 , :ceoOf }, whose only tuple results from q .Note that, using standard answering, the incorrect answer :p 2 , :hiredBy would have also been obtained from q , since under this semantics q asks for who is hired by an organization of some type (this is the case of :p 2 who is hired by a public administration) and not who is hired by an organization of the particular unknown type of company designated by :b C in G ex .
We now rely on the query reformulation algorithm, from [12], say Reformulate a , which takes as input a partially instantiated BGPQ q without RDFS triples, and a graph G and, using a set of reformulation rules associated with R a , outputs a reformulation Q a such that: The adapation of Reformulate a to an input UBGPQ instead of a BGPQ is straightforward.Furthermore, we notice that the algorithm would consider potential blank nodes in the input query as if they were IRIs.Hence, denoting by Q c,a the output of Reformulate a (Q c , O), we obtain: Putting together (20) and statement (19) in Theorem 1, we can prove the correctness of the global reformulation algorithm: Theorem 2. Let G be an RDF graph and q be a BGPQ without blank nodes.Let Q c,a be the reformulation of q by the 2-step algorithm described in Section 4.1.Then:

Experimental evaluation
We have implemented our reformulation algorithm on top of OntoSQL (https: //ontosql.inria.fr),a Java platform providing efficient RDF storage, saturation, and query evaluation on top of an RDBMS [9,12]; we used Postgres v9.6.
To save space, OntoSQL encodes IRIs and literals into integers, and a dictionary table which allows going from one to the other.It stores all resources of a certain type in a one-attribute table, all subject, object pairs for each data property in a table, and all schema triples in another table; the tables are indexed.Our server has a 2,7 GHz Intel Core i7 and 160 GB of RAM; it runs CentOs Linux 7.5.We generated LUBM ∃ data graphs [15] of 10M triples and restricted the ontology to RDFS, leading to 175 triples (123 ≺ sc , 5 ≺ sp , 25 ← d and 22 → r ).We devised 14 queries having from 3 to 7 triples; one has no result, while the others have a few dozen to three hundred thousand results.Each has 1 or 2 triples which match the ontology (and must be evaluated on it for correctness), including (but not limited to) the generic triple (x, y, z), which appears 7 times overall in our workload.Some of our queries are not handled through reformulation by AllegroGraph and Stardog, nor by Virtuoso (recall Section 3).
Figure 2 shows for each query: the size of the UBGPQ reformulation (in parenthesis after the query name on the x axis), i.e., the number of BGPQs it contains; the reformulation time (with both R c and R a ); the time to translate the reformulation into SQL; the time to evaluate this SQL query; the total query answering time through reformulation, and (for comparison) through saturation.Note the logarithmic y axis.Details of our experiments are available online 5 .The reformulation time is very short (0.2 ms to 55 ms).Unsurprisingly, the time to convert the reformulation into SQL is closely correlated with the reformulation size.The overhead of our approach is quite negligible, given that the answering time through reformulation is very close to the SQL evaluation time.
As expected, saturation-based query answering is faster; however, saturating this graph took more than 1289 seconds, while the slowest query (Q9) took 46 seconds.As in [12], we compute for each query Q a threshold n Q which is the smallest number of times we need to run Q, so that saturating G and running Q n Q times on G R is faster than n Q runs of Q through reformulation; intuitively, after n Q runs of Q, the saturation cost amortizes.For our queries, n Q ranged from 29 (Q9) to 9648 (Q5), which shows that saturation costs take a while to amortize.If the graph or the ontology change, requiring maintenance of the saturated graph, reformulation may be even more competitive.

Conclusion
We have presented a novel reformulation-based query answering technique for RDF graphs with RDFS ontologies.Its novelty lies in its capacity to handle query triples over both the assertions and the ontology; such queries are not always handled correctly by existing RDF engines.In the future, we plan to integrate our reformulation technique in the cost-based optimized reformulation framework we introduced in [9] to improve its performance, and to an OBDA setting along the lines of [14].
Acknowledgements: This work is supported by the Inria Project Lab grant iCoda, a collaborative project between Inria and several major French media.

Appendix
This appendix provides the figure of the running example graph and proofs of the results claimed in the paper.

Proof of Proposition 1
Proof.We provide an upper bound the number of reformulations explored during the reformulation of a BGPQ by analyzing the producer-consumer dependencies among rules w.r.t. the form of the query.Given a query form Q and an ontology O, we denote by #explored(Q, O) the number of explored reformulations during the execution of Reformulate c (q, O) for a BGPQ q of form Q.
First, we notice that for the most general query form Q(x) ← t 1 , t 2 , . . ., t n (where the t i are triples), it holds that: where Q i has the form Q i (x i ) ← t i , with x i the list of variables in t i .We now analyze the reformulations obtained for the different forms of queries composed of a single triple.
Let us consider the query form Q 0 (x) ← (s, v, o), where s and o are values or variables and v is a variable.The rules in (1) are the only rules that can consume Q 0 .They produce queries of the form ) (and there are two similar cases with properties → r and ≺ sp ).Since no rule feeds rule (1), it holds that: Queries of the form Q 2 only feed rules from ( 12) to (16)  Proof of Theorem 1 For the sake of readability, we assume in the following that G does not contain blank nodes.So, we do not need non-standard query evaluation.This assumption can be done without loss of generality.Indeed, we may define a one-to-one mapping f from the blank nodes of G to fresh IRIs, apply f to G before any processing, and apply the inverse mapping f −1 to the answer tuples obtained considering f (G) to get answers considering G.
With the above assumption, to prove statement 18, it remains to prove that (soundness) We want to prove that for all q σ reformulation of q in Q c , for all tuple t answer to q σ in G, there is G obtained from G by application of some entailment rules to G such that t is an answer to q in G .In other words, we want to prove that q σ (G, ∅) ⊆ q(G, R c ) .Since q σ (G, ∅) ⊆ q σ (G, R c ), it is sufficient to prove that q σ (G, R c ) ⊆ q(G, R c ).
The proof is done by induction on the length l of a sequence of reformulation rules leading to q σ , starting from O and q.Base step For l = 0, we have q σ = q, so q σ (G, R c ) ⊆ q(G, R c ). Inductive step For l < α, suppose that q σ (G, R c ) ⊆ q(G, R c ) holds.Now at l = α, q σ has been produced from q σ by the application of a reformulation rule (i) and q σ is a reformulation of q.So that sequence being of length < α, we get q σ (G) ⊆ q(G, R c ) by induction hypothesis.We will show that q σ (G, R c ) ⊆ q σ (G, R c ).There are basically two cases: the reformulation rule (i) instantiates a variable of q σ to generate q σ i.e., rule (i) is one of the following (1), ( 4)-( 7), ( 12)-( 14).In this case, q σ is contained in q σ , so q σ (G, R c ) ⊆ q σ (G, R c ). the reformulation rule (i) has the form t1∈qσ,t2∈O qσ[t1/t3] that replaces a triple in q σ by another one (or two for the rule (3)).Observe here that σ = σ holds.If ϕ(x σ ) ∈ q σ (G, R c ), then ϕ(t 3σ ) ∈ G Rc .Furthermore, the reformulation rules ensure that ϕ(t 3σ ), t 2 |= Rc ϕ(t 1σ ).As a result, ϕ(t 1σ ) ∈ G Rc , and ϕ is a total assignment of the variables of q σ such that ϕ( (completeness) We now show that q(G, R c ) ⊆ Q c (G) with Q c (G) = q σ ∈Reformulatec(q,O) q σ (G, ∅), i.e., for each answer tuple a ∈ q(G, R c ), there exists q σ ∈ Q c a reformulation of q using O such that a ∈ q σ (G, ∅).In the following, we will consider that Q c contains queries in which all the instantiated RDFS triples that belong to the ontology are kept; in other words, the triples removed by applications of rule (2) are restored in the resulting queries.This has no impact on the completeness of the algorithm, since the reformulations output in both versions have the same answers in G. Let the query q be defined by q(x) ← t 1 , t 2 , . . ., t n with t i being the body triples of q.An answer from q(G, R c ) has the form ϕ(x), where ϕ is a homomorphism from body(q) to G Rc .If for all triples t i from the body of q, ϕ(t i ) is not an RDFS triple, then ϕ(body(q)) ∈ G (because data triples are not entailed by R c ), so a valid reformulation of q is q itself, since q(G, R c ) = q(G, ∅).Otherwise, there exists a triple t i from the body of q such that ϕ(t i ) ∈ O Rc and we will show that there exists q σ a reformulation of q where only t i has been replaced by a BGP P such that P ⊆ O and ϕ(x) = ϕ(x σ ).First case, ϕ(t i ) = (c, ≺ sc , c ) ∈ G Rc ; according to Lemma 1, there is C = ((c i , ≺ sc , c i+1 )) 1≤i<c a ≺ sc -chain in G such that c = c 1 and c = c c .The triple t i can have one of the following forms: -(c, ≺ sc , c ), then we consider q σ obtained from q by applying rule (15) for each triple of C; finally (c, ≺ sc , c ) is replaced by -(c, ≺ sc , v ), then we consider q σ obtained from q by applying rule (15) for each triple of C then (14); finally (c, ≺ sc , v ) is replaced by ( ), then we consider q σ obtained from q by applying rule (1) then (15) for each triple of C then (14 ), then we consider q σ obtained from q by applying rule (16) for each triple of C in inverse order then (13); ), then we consider q σ obtained from q by applying rule (1) then (16) for each triple of C in inverse order then (13); finally ), then we consider q σ obtained from q by applying rule ( 12) then (15) for each triple of C then ( 14); finally (v, ≺ sc , v ) is replaced by (c c−1 , ≺ sc , c ) ∈ O. Since σ = {v → c , v → c}, ϕ(x) = ϕ(x σ ).-(v, v p , v ), then we consider q σ obtained from q by applying rule (1) then ( 12) then (15) for each triple of C then (14) ; finally (v, v p , v ) is replaced by (c c−1 , ≺ sc , c ) ∈ O. Since σ = {v p →≺ sc , v → c , v → c}, ϕ(x) = ϕ(x σ ).Second case, ϕ(t i ) = (p, ← d , c) ∈ G Rc ; according to Lemma 1, there are three cases, depending on whether a chain is empty or not.We describe the case where none of the chains is empty, hence assuming that there exists P = ((p i , ≺ sp , p i+1 )) 1≤i≤p a ≺ sp -chain in G from p to p and (p , ← d , c ) ∈ G and there exists C = ((c i , ≺ sc , c i+1 )) 1≤i≤c a ≺ sc -chain in G from c to c.The other cases are handled similarly using also rules (4) and (5).The triple t i can have the following forms: -(p, ← d , c), then we consider q σ obtained from q by applying rule (10) for each triple in C in inverse order then (11) for each triple in P ; finally (p, ← d , c) is replaced by (p , ← d , c ) ∈ O. Since σ = ∅, ϕ(x) = ϕ(x σ ).-(p, ← d , v ), then we consider q σ obtained from q by applying rule (11) for each triple in P then (9) then (15) for each triple in C then (14) ; finally (p, ← d , c) is replaced by (c c−1 , ≺ sc , c) ∈ O. Since σ = {v → c}, ϕ(x) = ϕ(x σ ).-(p, v p , v ), then we consider q σ obtained from q by applying rule (1) then (11) for each triple in P then (9) then (15) for each triple in C then ( 14 -(v, ← d , c), then we consider q σ obtained from q by applying rule (10) for each triple in C in inverse order then (8) then (16) for each triple in P in inverse order then (13) ; finally (v, ← d , c) is replaced by (p, ≺ sp , p 2 ) ∈ O. Since σ = {v → p}, ϕ(x) = ϕ(x σ ).-(v, v p , c), then we consider q σ obtained from q by applying rule (1) then (10) for each triple in C in inverse order then (8) then (16) for each triple in P in inverse order then (13) ; finally (v, v p , c) is replaced by (p, ≺ sp , p 2 ) ∈ O. Since σ = {v p →← d , v → p}, ϕ(x) = ϕ(x σ ).-(v, ← d , v ), then we consider q σ is obtained from q by applying rule (3) then (16) for each triple in P inverse order then (13) then on the other triple, (15) for each triple in C then (14) ; finally (v, ← d , v ) is replaced by (p, ≺ sp , p 2 ), (c c−1 , ≺ sp , c) ∈ O. Since σ = {v → p, v → c}, ϕ(x) = ϕ(x σ ).
Hence, for each triple t i in q such that ϕ(t i ) ∈ O Rc , there is q σ , a reformulation of q, where only t i has been replaced by a BGP P such that P ⊆ O and ϕ(x) = ϕ(x σ ).It follows that there is q σ , a reformulation of q, in which all body triples of q mapped by ϕ to O Rc have been replaced by triples that belong to O, such that ϕ(x) = ϕ(x σ ).Since the other triples of q are necessarily mapped by ϕ to G (actually, G \ O), we conclude that ϕ(x) = ϕ(x σ ) is an answer to q σ in G.This concludes the proof of statement (18), which is the only part of Theorem 1 needed in the proof of Theorem 2. Statement (19) actually follows from the next lemma (Lemma 2).

Proof of Theorem 2
Lemma 2. For all RDF graph G, it holds that: Proof.For one direction: G Ra Rc ⊆ G Ra∪Rc .The proof is trivial.
For the converse direction G Ra∪Rc ⊆ G Ra Rc .We take a triple t ∈ G Ra∪Rc , and differentiate two cases: either t is not an RDFS triple, then by applying Theorem 1 of [12], t ∈ G Ra .In other words, assertion rules suffice to entail all RDF assertions.or t is an RDFS triple.Since the RDFS ontology O of G does not contain an RDFS property as subject or object, the entailment rule rdfs7 does not entail RDFS triples.So, t ∈ O or t has been produced by a rule in R c .Moreover, all rules in R c have a body that contains only RDFS triples, so t ∈ O or t has been entailed from O using rules in R c , i.e., t ∈ O Rc .We also know that O Rc ⊆ G Ra Rc , so t ∈ G Ra Rc .
In both cases, we have proven that t ∈ G Ra Rc .
Proof (of the theorem).

Fig. 1 .
Fig. 1.Reformulation rules for a partially instantiated query qσ w.r.t. an RDFS ontology O.For compactness, we factorize similar rules, using the symbol ← to denote either ← d or →r, and ≺ to denote either ≺sc or ≺sp.

Algorithm 1 : 3 produced ← ∅ 4 for each qσ ∈ toExplore do 5 if 6 result ← result ∪ {qσ} 7 forProposition 1 .
Reformulate cInput : BGPQ q and ontology O Output: the reformulation of q with the rules from Fig.11 result ← ∅; toExplore ← {q}; explored ← ∅ 2 while toExplore = ∅ do qσ does not contain any RDFS triple then each RDFS triple t in qσ do 8 for each q σ obtained by applying a reformulation rule to t do 9 produced ← produced ∪ {q σ } 10 explored ← explored ∪ {qσ} 11 toExplore ← produced \ explored 12 return result simply exponential in the size of q.More precisely: The algorithm Reformulate c runs in time O(|V al(O)| 6|q| ), where |q| is the number of triples in the body of q.

Lemma 1 .
Let G be an RDF graph.It holds that:-(c, ≺ sc , c ) ∈ G Rc iff G contains a non-empty ≺ sc -chain from c to c ; -(p, ≺ sp , p ) ∈ G Rc iff G contains a non-empty empty ≺ sp -chain from p to p ; -(p , ← d , c ) ∈ G Rc iff G contains a triple (p, ← d , c), a (possibly empty) ≺ spchain from p to p and a (possibly empty) ≺ sc -chain from c to c .The case for (p , → r , c ) ∈ G Rc is similar (replace ← d by → r in the statement above).

Fig. 3 .
Fig. 3. Illustration of Example 1, where black edges represent assertion triples and blue edges represent RDFS triples.Plain edges are those contain in Gex and dotted edges those added its saturation G R ex .

.
The only entailment rules in R c that entail a triple with property ← d are ext1 and ext3.The body of these rules contain a triple with property ← d , so there exists an entailment chain (of triples with ← d property) of length l ≥ 0 starting from G and using only rules ext1 and ext3.We prove by induction on l that G contains a triple (p, ← d , c), a (possibly empty) ≺ sp -chain from p to p and a (possibly empty) ≺ sc -chain from c to c .-If l = 0, then (p , ← d , c ) ∈ G and there are an empty ≺ sp -chain from p to p and an empty ≺ sc -chain from from c to c .-Otherwise (l > 0), the last rule applied in the chain is:• either ext1, so G Rc contains a triple (p , ← d , c 1 ), which results from an entailment chain of length l−1 starting from G and using only rules ext1 and ext3, and a triple (c 1 , ≺ sc , c ).By induction hypothesis, we know that G contains a triple (p, ← d , c), a (possibly empty) ≺ sp -chain from p to p and a (possibly empty) ≺ sc -chain from c to c 1 .Moreover, by using the first point of the lemma (proved above), (c 1 , ≺ sc , c ) ∈ G Rc implies that G contains a non-empty ≺ sc -chain from c 1 to c .So, concatenating the two ≺ sc -chains, we obtain a ≺ sc -chain from c to c .Hence, G contains a triple (p, ← d , c), a (possibly empty) ≺ sp -chain from p to p and a ≺ scchain from c to c .• or ext3, and the proof is similar to that for ext1, replacing ≺ sc -chains by ≺ sp -chains.We have proven that (p , ← d , c ) ∈ G Rc implies that G contains a triple (p, ← d , c), a (possibly empty) ≺ sp -chain from p to p and a (possibly empty) ≺ sc -chain from c to c .The converse implication is straightforward: from the two first points of the lemma, we obtain (c, ≺ sc , c ) ∈ G Rc and (p , ≺ sp , p) ∈ G Rc , then by one application of each entailment rule ext1 and ext3, we obtain (p , ← d , c ) ∈ G Rc .

Table 2 .
RDFS entailment rules.The standard RDF entailment rules are defined in[2].In this work, we consider the rules shown in Table2, which we call RDFS entailment rules; all values except the τ, ≺ sc , ≺ sp , ← d , → r properties are blank nodes.These rules are the most frequently used for RDFS entailment; they produce implicit triples by exploiting the RDFS ontological constraints of an RDF graph.For example, the rule rdfs9, which propagates values from subclasses to their superclasses, is defined by body(rdfs9) = {(s, ≺ sc , o), (s 1 , τ, s)} and head(rdfs9) = {(s 1 , τ, o)}.The direct entailment of an RDF graph G with a set of RDF entailment rules R, denoted by C G,R , characterizes the set of implicit triples resulting from rule applications that use solely the explicit triples of G.It is defined as: The saturation of G ex w.r.t. the set R of RDFS entailment rules shown in Table2is attained after the following two saturation steps: (11)ese rules always produce queries of the form Q 2 ( x2 ) ← (s, ≺ sc , o), where s and o are either values from Val(O) or variables.Moreover, there are at most 2 variables in Var(Q 2 ), which can only be instantiated by values from Val(O) or variables.Concerning a query of the form Q 1 , either rule (3) can be applied, then produced queries have the form Q 3 ( x3 ) ← (v 1 , ≺ sp , p), (c, ≺ sc , v 2 ), or a rule from (4) to(11)can be applied.In the later case, we observe that all further produced queries will have the form Q 4 ( x4 ) ← (s, p, o) with p ∈ {← d , ≺ sc , ≺ sp }, s and o belonging to Val(O) ∪ Var(Q 4 ), and there is at most one variable among s and o.Proof of Lemma 1The only entailment rule in R c that allows one to infer a new triple with property ≺ sc (respectively ≺ sp ) is the rule rdfs11 (resp.rdfs5).Since this rule states the transitivity of the property ≺ sc (resp.≺ sp ), it holds that (c, ≺ sc , c ) ∈ G Rc iff G contains a non-empty ≺ sc -chain from c to c (resp.(p, ≺ sp , p ) ∈ G Rc iff G contains a non-empty empty ≺ sp -chain from p to p ). Assume now that (p , ← d , c ) ∈ G Rc