Answering Pattern Queries Using Views

Answering queries using views has proven effective for querying relational and semistructured data. This paper investigates this issue for graph pattern queries based on graph simulation. We propose a notion of pattern containment to characterize graph pattern matching using graph pattern views. We show that a pattern query can be answered using a set of views if and only if it is contained in the views. Based on this characterization, we develop efficient algorithms to answer graph pattern queries. We also study problems for determining (minimal, minimum) containment of pattern queries. We establish their complexity (from cubic-time to NP-complete) and provide efficient checking algorithms (approximation when the problem is intractable). In addition, when a pattern query is not contained in the views, we study maximally contained rewriting to find approximate answers; we show that it is in cubic-time to compute such rewriting, and present a rewriting algorithm. We experimentally verify that these methods are able to efficiently answer pattern queries on large real-world graphs.


INTRODUCTION
Answering queries using views has been extensively studied for relational data [27], [33], XML [30], [50], [51] and semistructured data [11], [43], [52].Given a query Q and a set V = {V 1 , . . ., V n } of views, the idea is to find another query A such that A is equivalent to Q, and A only refers to views in V [27].If such a query A exists, then given a database D, one can compute the answer Q(D) to Q in D by using A, which uses only the data in the materialized views V i (D), without accessing D. This is particularly effective when D is "big" and/or distributed.Indeed, views have been advocated for scale independence, to query big data "independent of" the size of the underlying data [8], [17].They are also useful in data integration [33], data warehousing, semantic caching [14], and access control [16].
The need for studying this problem is even more evident for answering graph pattern queries (a.k.a.graph pattern matching) [18], [28].Graph pattern queries have been increasingly used in social network analysis [10], [18], among other things.Real-life social graphs are typically large, and are often distributed.For example, Facebook has more than 1.26 billion users with 140 billion links [46], and the data is geo-distributed to various data centers [26].One of the major challenges for social network analysis is how to cope with the sheer size of real-life social graphs when evaluating graph pattern queries.Graph pattern matching using views provides an effective method to query such big data.
Example 1: A fraction of a recommendation network is depicted as a graph G in Fig. 1 (a), where each node denotes a person with name and job title (e.g., project manager (PM), database administrator (DBA), programmer (PRG), business analyst (BA) and software tester (ST)); and each edge indicates collaboration/recommendation relation, e.g., (Bob, Dan) indicates that Dan worked well with Bob, on a project led by Bob.To build a team, one issues a pattern query Q s depicted in Fig. 1 (c), to find a group of PM, DBA and PRG.It requires that (1) DBA 1 and PRG 2 worked well under the project manager PM; and (2) each PRG (resp.DBA) had been supervised by a DBA (resp.PRG), represented as a collaboration cycle [31] in Q s .For pattern matching based on graph simulation [18], [47], the answer Q s (G) to Q s in G can be denoted as a set of pairs (e, S e ) such that for each pattern edge e in Q s , S e is a set of edges (a match set) for e in G.For example, pattern edge (PM, PRG 2 ) has a match set S e = {(Bob, Dan), (Walt, Bill)}, in which each edge satisfies the node labels and connectivity constraint of the pattern edge.
It is known that it takes O(|Q s | 2 + |Q s ||G| + |G| 2 ) time to compute Q s (G) [18], [28], where |G| (resp.|Q s |) is the size of G (resp.Q s ).For example, to identify the match set of each pattern edge (DBA i , PRG i ) (for i ∈ [1,2]), each pair of (DBA, PRG) in G has to be checked, and moreover, a number of join operations have to be performed to eliminate invalid matches.This is a daunting cost when G is big.One can do better by leveraging a set of views.Suppose that a set of views V = {V 1 , V 2 } is defined, materialized and cached (V(G) = {V 1 (G), V 2 (G)}), as shown in Fig. 1 (b).As will be shown later, to compute Q s (G), (1) we only need to visit views in V(G), without accessing the original big graph G; and (2) Q s (G) can be efficiently computed by "merging" views in V(G).Indeed, the views V(G) already contains partial answers to Q s in G: for each pattern edge e in Q s , the matches of e (e.g., (DBA 1 , PRG 1 )) are contained either in V 1 (G) or V 2 (G) (e.g., the matches of e 3 in V 2 ).These partial answers can be used to construct the complete match Q s (G).Hence, the cost of computing Q s (G) is in quadratic time in |Q s | and |V(G)|, where V(G) is much smaller than |G|. 2 This example suggests that we conduct graph pattern matching by capitalizing on available views.To do this, several questions have to be settled.(1) How to decide whether a pattern query Q s can be answered by a set V of views?(2) If so, how to efficiently compute Q s (G) from V(G)? (3) If not, how to find approximate answers to Q s (G) by using V(G)?(4) In both cases, which views in V should we choose to (approximately) answer Q s ?
Contributions.This paper investigates these questions for answering graph pattern queries using graph pattern views.We focus on graph pattern matching defined in terms of graph simulation [28], since it is commonly used in social community detection [10], biological analysis [35], and mobile network analyses [24].While conventional subgraph isomorphism often fails to capture meaningful matches, graph simulation fits into emerging applications with its "many-to-many" matching semantics [10], [18], [28].Moreover, it is more challenging since graph simulation is "recursively defined" and has poor data locality [15].
(1) To characterize when graph pattern queries can be answered using views based on graph simulation, we propose a notion of pattern containment (Section 3).It extends the traditional notion of query containment [6] to deal with a set of views.Given a pattern query Q s and a set V of view definitions, we show that Q s can be answered using V if and only if Q s is contained in V.
Based on the characterization, we provide an evaluation algorithm for answering graph pattern queries using views (Section 3).Given Q s and a set V(G) of views on a graph G, the algorithm computes Q s (G) in O(|Q s ||V(G)| + |V(G)| 2 ) time, without accessing G at all when Q s is contained in V.It is far less costly than O(|Q s | 2 + |Q s ||G| + |G| 2 ) for evaluating Q s directly on G [18], [28], since G is typically much larger than V(G) in practice.
(2) To decide which views in V to use when answering Q s , we identify three fundamental problems for pattern containment (Section 4).Given Q s and V, (a) the containment problem is to decide whether Q s is contained in V, (b) minimal containment is to identify a subset of V that minimally contains Q s , and (c) minimum containment is to find a minimum subset of V that contains Q s .We show that the first two problems are in cubic-time, whereas the third one is NP-complete and hard to approximate (APX-hard).
The results are also useful for query minimization.Indeed, when V contains a single view, the containment problem becomes the classical query containment problem [6].
These results are a nice surprise.Recall that even for relational SPC (a.k.a.conjunctive) queries, the problem of query containment is NP-complete [6]; for XPath fragments, it is EXPTIMEcomplete or even undecidable [41].In contrast, (minimal) containment for graph pattern queries is in low PTIME, although the queries may be "recursively defined" (as cyclic patterns).
(3) We develop efficient algorithms for checking (minimal, minimum) pattern containment (Section 5).For containment and minimal containment, we provide cubic-time algorithms in the sizes of query Q s and view definitions V, which are much smaller than graph G in practice.For minimum containment, we provide an efficient approximation algorithm with performance guarantees.
(4) When exact answers of a query Q s cannot be computed using views V, i.e., when Q s is not contained in V, one wants to find the maximal part of Q s that can be answered using V.We study the problem of maximally contained rewriting (Section 6).A query Q s is a maximally contained rewriting of Q s if (a) it is a subquery of Q s , (b) it is contained in V, and (c) Q s is not a subquery of any larger contained rewriting of Q s .We show that a maximally contained rewriting Q s of Q s w.r.t.V can be found in cubic-time, by presenting such an algorithm.This provides us with a querydriven approximation scheme, by treating Q s (G) as approximate query answers to Q s in a big graph G.Alternatively, one can compute exact answers Q s (G) by using Q s (G) and additionally, accessing a small fraction of G, along the same lines as the scale independence approach suggested in [17].
(5) Using real-life graphs (Amazon, YouTube, Citation and Web-Graph), we experimentally verify the effectiveness, efficiency and accuracy of our view-based matching method (Section 7).We find that this method is 23.2 times faster than conventional methods for pattern queries on WebGraph [3], a Web graph with 118.1 million nodes (web pages) and 1.02 billion edges (hyperlinks).In addition, our matching algorithm scales well with data size and pattern size; and our algorithms for (minimal, minimum) pattern containment tests take 0.15 second on complex (cyclic) patterns.We further find that our algorithm can compute maximally contained rewriting Q s efficiently, and that the query results of Q s on V(G) has accuracy of 0.73 (F-measure) on average on WebGraph.
The work is a first step toward understanding graph pattern matching using views, from theory to practical methods.We contend that the method is effective: one may pick and cache previous query results, and efficiently answer pattern queries using the views, without accessing large social graphs directly.If a query Q s is not contained in a set of views, one can either adjust the views or approximately answer Q s by making use of a maximally contained rewriting of Q s .Better still, incremental methods are already in place to efficiently maintain cached pattern views (e.g., [20]).The view-based method can be readily combined with existing distributed, compression and incremental techniques, and yield a promising approach to querying "big" social data.
Related Work.This work extends [21] by including new proofs, results and experimental study: (1) proofs for the pattern containment characterization (Section 3); (2) proofs of the fundamental problems for pattern containment (Section 4); (3) algorithms contain and minimum (Section 5); (4) results and proofs for maximally contained rewriting for graph pattern matching (Section 6), a topic not studied in [21]; and (5) two sets of new experiments (Section 7): one for evaluating the effectiveness of our approach using graphs with billions of nodes and edges [3], and the other for the efficiency and accuracy of approximate query answering by means of maximally contained rewriting.
We categorize other related work as follows.
Query answering and rewriting.There are two view-based approaches for query processing: query rewriting and query answering [27], [33].Given a query Q and a set V of views, (1) query rewriting is to reformulate Q into an equivalent query Q in a fixed language such that for all D, Q(D) = Q (D), and moreover, Q refers only to V; and (2) query answering is to compute Q(D) by evaluating a query A equivalent to Q, while A refers only to V and its extensions V(D).While the former requires that Q is in a fixed language, the latter imposes no constraint on A.
We next review previous work on these issues for relational databases, XML data and general graphs.
(1) Relational data.Query processing using views has been extensively studied for relational data (see [6], [27], [33] for surveys).It is known that for SPC (conjunctive) queries, query answering and rewriting using views are intractable [27], [33].For the containment problem, the well-known homomorphism theorem shows that an SPC query is contained in another if and only if there exists a homomorphism between the tableaux representing the queries, and it is NP-complete to determine the existence of such a homomorphism [6].Moreover, the containment problem is undecidable for relational algebra [6].
(2) XML.There has also been a host of work on processing XML queries using views [39], [41], [44].In [39], the containment of simple XPath queries is shown coNP-complete.When disjunction, DTDs and variables are taken into account, the problem ranges from coNP-complete to EXPTIME-complete to undecidable for various XPath classes [41].In [7], containment and query rewriting of XML queries are studied under constraints expressed as a structural summary.For tree pattern queries (a fragment of XPath), [30] and [50] have studied maximally contained rewriting.
(3) Semistructure data.Views defined in Lorel are studied in, e.g., [52], which are quite different from graph patterns considered here.View-based query rewriting for regular path queries (RPQs) is shown PSPACE-complete in [11], and an EXPTIME rewriting algorithm is given in [43].The containment problem is shown undecidable for RPQs under constraints [25] and for extended conjunctive RPQs [9].
(4) RDF.An EXPTIME query rewriting algorithm is given in [32] for SPARQL.It is shown in [13] that query containment is in EXPTIME for PSPARQL, which supports regular expressions.There has also been work on evaluating SPARQL queries on RDF based on cached query results [14].
Our work differs from the prior work in the following.(1) We study query answering using views for graph pattern queries via graph simulation, which are quite different from previous settings, from complexity bounds to processing techniques.(2) We show that the containment problem for the pattern queries is in PTIME, in contrast to its intractable counterparts for e.g., SPC, XPath, RPQs and SPARQL.(3) We study a more general form of query containment between a query Q s and a set of queries, to identify an equivalent query for Q s that is not necessarily a pattern query.(4) The high complexity of previous methods for query answering using views hinders their applications in the real world.In contrast, our algorithms have performance guarantees and yield a practical method for querying real-life social networks.

Pattern queries on big graphs.
There have been a host of techniques for graph pattern queries via simulation on "big" and/or distributed graphs.We next review some of them.
(1) Distributed graph simulation [22], [23], [37].Several algorithms are in place for distributed graph simulation, by following the synchronized message passing strategy [23] of Pregel [38], scheduling message passing across different fragments, and by integrating (incremental) partial evaluation, partitioned parallelism and message passing [22]; performance guarantees on data shipment and response time are provided in [22].
(2) Graph compression.To query "big" graphs, query-preserving compression [19] and graph summarization [40] have been proposed to reduce the search space by converting a big G to a smaller graph G c , and evaluate queries on G c without decompression [19].
(3) Incremental view maintenance.As real-life graphs are updated frequently, techniques for incremental graph simulation have been developed [20] with complexity measured in the size of changes to the input and output, independent of the size of the original big graphs.These allow us to efficiently maintain graph pattern views.
(4) Bounded evaluation.A class of access constraints, a characterization and algorithms have been developed in [12], which allow us to decide whether a pattern query Q can be answered by accessing a small fraction G Q of a big graph G under the access constraints, and if so, to compute Q(G) by accessing G Q only.The methods work for both graph simulation and subgraph isomorphism.
This work can be naturally combined with distributed, compression and incremental techniques.For example, view-based techniques can be employed for local evaluation of graph simulation in the distributed algorithm of [22]; views can be cached for simulation-preserving compressed graphs of [19] instead of the original graphs G, which are only 43% of the size of G on average; and the incremental techniques of [20] can be used to efficiently maintain views when graphs are updated.Moreover, maximally contained views can be combined with access constraints [12] to compute exact query answers following [17].Taken together, these methods yield a promising approach to querying "big" graphs.
The techniques of this work can be readily extended to various revisions of graph simulation such as bounded simulation (reported in [21]), dual simulation and strong simulation [36] (see discussion in Section 8).Due to the space constraints, we only report our findings about graph simulation in this paper.

GRAPHS, PATTERNS AND VIEWS
We first review pattern queries and graph simulation.We then state the problem of pattern matching using views.

Data Graphs and Graph Pattern Queries
Data graphs.A data graph is a directed graph G = (V, E, L), where (1) V is a finite set of nodes; (2) E ⊆ V × V , in which (v, v ) denotes an edge from node v to v ; and (3) L is a function such that for each node v in V , L(v) is a set of labels from an alphabet Σ. Intuitively, L specifies the attributes of a node, e.g., name, keywords, blogs and social roles [31].
Pattern queries [18].A graph pattern query, denoted as , where (1) V p and E p are the set of pattern nodes and the set of pattern edges, respectively; and (2) f v is a function defined on V p such that for each node u ∈ V p , f v (u) is a label in Σ.We remark that f v can be readily extended to specify search conditions in terms of Boolean predicates [18] (see Figure 8 for examples of search conditions).
Graph pattern matching.We say that a data graph G = (V, E, L) matches a graph pattern query Q s = (V p , E p , f v ) via simulation, denoted by Q s G, if there exists a binary relation S ⊆ V p × V , refereed to as a match in G for Q s , such that • for each pattern node u ∈ V p , there exists a node v ∈ V such that (u, v) ∈ S, referred to as a match of u; and • for each pair (u, v) ∈ S, (a) f v (u) ∈ L(v); and moreover, (b) for each pattern edge e = (u, u ) in E p , there exists an edge (v, v ) in E, referred to as a match of e in S, such that (u , v ) ∈ S, i.e., v is a match of u .
When Q s G, there exists a unique maximum match S o in G for Q s [28].We derive {(e, S e ) | e ∈ E p } from S o , where S e is the set of all matches of e in S o , referred to as the match set of e.Here S e is nonempty for all e ∈ E p .We define the result of Q s in G, denoted as Q s (G), to be the unique maximum set We define the size of query Q s , denoted by |Q s |, to be the total number of nodes and edges in Q s ; we define the size |Q s (G)| of Q s (G) to be the total edge number of sets S e for all edges in Q s .
Example 2: Consider pattern query Q s shown in Fig. 1 (c), where each node carries a search condition (job title), and each edge indicates a collaboration relationship.When Q s is posed on the network G of Fig. 1  To simplify the discussion, we consider w.l.o.g.graph patterns Q s that are connected, as commonly found in real life.That is, for any nodes u and u in Q, there is an undirected path between u and u , by treating Q s as an undirected graph.One can easily verify the following, by a straightforward induction on the number of edges in Q s , based on the definition of graph simulation.
Lemma 1: For any connected pattern Q s and any graph G, if Q s (G) = ∅, then S e = ∅ for all e in Q s . 2

Graph Pattern Matching Using Views
We next formulate the problem of graph pattern matching using views.We study views V defined as a graph pattern query, and refer to the query result V(G) in a data graph G as the view extension for V in G or simply as a view [27].Given a pattern query Q s and a set V = {V 1 , . . ., V n } of view definitions, graph pattern matching using views is to find another query A such that for all data graphs G, (1) A only refers to views ).If such a query A exists, we say that Q s can be answered using views V.
In contrast to query rewriting using views [27], here A is not required to be a pattern query [33].For example, Figure 1 To answer the query Q s (Fig. 1 (c)), we want to find a query A that computes Q s (G) by using only V and V(G), where A is not necessarily a graph pattern.
For a set V of view definitions, we define the size |V| of V to be the total size of V i 's in V, and the cardinality card(V) of V to be the number of view definitions in V.
The notations of the paper are summarized in Table 1.Remark.Our techniques also work on graphs and queries with edge labels.Indeed, an edge-labeled graph can be converted to the number of view definitions in V TABLE 1: A summary of notations a node-labeled graph: for each edge e, add a "dummy" node carrying the label of e, along with two unlabeled edges.

A CHARACTERIZATION
In this section, we propose a characterization of graph pattern matching using views, i.e., a sufficient and necessary condition for deciding whether a pattern query can be answered by using a set of views.We also provide a quadratic-time algorithm for answering pattern queries using views.
Pattern containment.We first introduce a notion of pattern containment, by extending the traditional notion of query containment to a set of views.Consider a pattern query if there exists a mapping λ from E p to powerset P( i∈ [1,n] E i ), such that for all data graphs G, the match set S e ⊆ e ∈λ(e) S e for all edges e ∈ E p .
The analysis involves query Q s and view definitions V, independent of data graphs G and view extensions V(G).
Example 3: Recall G, V and Q s given in Fig. 1.Then Q s V. Indeed, there exists a mapping λ from E p of Q s to sets of edges in V, which maps (a) edges (PM, DBA 1 ) and of Q s to e 3 , and (c) (PRG 1 , DBA 2 ) and (PRG 2 , DBA 1 ) to e 4 in V 2 .In any graph G, one may verify that for any edge e of Q s , its matches are contained in the union of the match sets of the edges in λ(e).For instance, the match set of pattern edge Pattern containment and query answering.The main result of this section is as follows: (1) pattern containment characterizes pattern matching using views; and (2) when Q s V, for all graphs G, Q s (G) can be efficiently computed by using views V(G) only, independent of the size |G| of the underlying graph G.In Sections 4 and 5 we will show how to decide whether Q s V by inspecting Q s and V only, also independent of |G|.
Theorem 2: (1) A pattern query Q s can be answered using V if and only if This suggests an approach to answering graph pattern queries, as follows.Given a pattern Q s and a set V of views, we first efficiently determine whether Q s V (by using the algorithm to be given in Section 5); if so, for all (possibly big) graphs G we compute Q s (G) by using V(G) instead of G, in quadratic-time in the size of V(G), which is much smaller than G.
Below we prove Theorem 2.
(I) We first prove the Only If condition, i.e., if Q s can be answered using V, then Q s V. We show this by contradiction.Assume that Q s can be answered using V, while Q s V.By the definition of containment, there must exist some data graph G o such that for all the possible mappings λ, there always exists at least one edge e in Q s such that S e ⊆ e ∈λ(e) S e .Consider the following two cases.(1) When Q s (G o ) = ∅.By Lemma 1, for all e in Q s , S e = ∅ in G o and hence it contradicts to the assumption that S e ⊆ e ∈λ(e) S e .(2) When Q s (G o ) = ∅.If so, there must exist at least one edge e o in G o such that e o is in S e for some edge e in Q s , but it is not in S e for any e ∈ λ(e).That is, e o cannot be included in S e for any e ∈ λ(e), for all possible λ.This contradicts the assumption that Q s can be answered using only V and V(G o ), since at least the edge e o is missing from V(G o ) for some graph G o , no matter how λ is defined.Therefore, Q s can be answered using V only if Q s V.
(II) We next show the If condition of Theorem 2(1) with a constructive proof: we give an algorithm to evaluate Algorithm.We next present the algorithm that evaluates Q s using V.The algorithm, denoted as MatchJoin, is shown in Fig. 2. It takes as input (1) a pattern query Q s and a set of view definitions In a nutshell, it computes Q s (G) by "merging" (joining) views V i (G) as guided by λ.The merge process iteratively identifies and removes those edges that are not matches of Q s , until a fixpoint is reached and Q s (G) is correctly computed.
More specifically, MatchJoin works as follows.It starts with empty match sets S e for each pattern edge e (lines 1-2).MatchJoin sets S e as e ∈λ(e) S e , where S e is extracted from V(G) (lines 3-4), following the definition of λ(e).It then performs a fixpoint computation to remove all invalid matches from S e (lines 5-10).For each pattern edge e p = (u, u ) with its match set S ep changed, it checks whether the change propagates to the "parents" (i.e., u with edge (u , u)) of u.That is, it checks whether each match e of e = (u , u) still remains to be a match of e (lines 7-10), following the definition of simulation (Section 2.1).More specifically, it checks whether a child u 1 of u (resp.a child u 2 of u) has no match as a child If so, e is no longer a match of e due to that v (resp.v) is invalid match of u (resp.u), and is removed from S e (lines 8,10).In the process, if S e becomes empty for some edge e, MatchJoin returns ∅ since Q s has no match in G. Otherwise, the process (lines 5-11) proceeds until Q s (G) is computed and returned (line 12).
Example 4: Consider G, Q s and V shown in Fig. 3.One can verify Q s V by a mapping λ that maps (AI, Bio), (PM, AI) to e 1 , e 2 in V 1 , respectively; and (DB, AI), (AI, SE), (SE, DB) to e 3 , e 4 , e 5 in V 2 , respectively.MatchJoin then merges view matches guided by λ, removes (AI 1 , SE 1 ) from S (AI,SE) , which is an invalid match for (AI, SE) in Q s .This further leads to the removal of (SE 1 , DB 2 ) from S (SE,DB) , and (DB 2 , AI 2 ) from S (DB,AI) .This Input: A pattern query Qs, a set of view definitions V and their extensions V(G), a mapping λ.Output: The query result M as Qs(G).
for each e ∈ λ(e) do Se := Se ∪ S e ; 5. while there is change in Se p for an edge ep = (u, u ) in Qs do 6.
for each e = (u , u) in Qs and e = (v , v) ∈ Se do 7.
if Se = ∅ then return ∅; 12. return M = {(e, Se) | e ∈ Qs}, which is Qs(G); Fig. 2 To complete the proof of Theorem 2, we show that (1) Correctness.For each edge e in Q s , we denote the match set of e in G as S * e when MatchJoin progresses to process Q s and V(G).For the correctness of MatchJoin, it suffices to show the following two invariants it preserves: (1) at any time, for each edge e of Q s , S * e ⊆ S e ; and ( 2) S e = S * e when MatchJoin terminates.For if these hold, then MatchJoin never misses any match or introduces any invalid match when it terminates.(1).By Q s V, there exists a mapping λ such that S e ⊆ e ∈λ(e) S e .Algorithm MatchJoin takes as input λ, Q s , V and V(G) (Fig. 2).(1) For each edge e in Q s , it initializes S e by merging S e for all e ∈ λ(e).Hence S * e ⊆ S e due to Q s V. (2) During the while loop (lines 5-10, Fig. 2), MatchJoin repeatedly refines S e by removing matches that are no longer valid according to the definition of graph simulation.More specifically, for an edge e p = (u, u ) in Q s with S ep changed, the matches e = (v , v) ∈ S e for all e = (u , u) in Q s become invalid if (a) there is an edge e 1 = (u , u 1 ) in Q s but there exists no match (v , v 1 ) ∈ S e1 (lines 7-8); or (b) there is an edge e 2 = (u, u 2 ) in Q s but there exists no match (v, v 2 ) ∈ S e2 (lines 9-10).Note that (i) both cases indicate that at least a match becomes invalid, and (ii) there exist no other cases that make a match invalid, by the definition of graph simulation.Hence MatchJoin never removes a true match, and never misses an invalid match by checking the two conditions.Thus, S * e ⊆ S e during the loop (lines 5-10).Proof of Invariant (2).When algorithm MatchJoin terminates, either (1) S e becomes empty (line 11), or (2) no invalid match can be found.Since S * e ⊆ S e during the entire loop (Invariant (1)), if it is case (1), then there exists some edge e such that S * e is empty.That is, G does not match Q s , and MatchJoin returns ∅ correctly.Otherwise, i.e., in case (2), all invalid matches are removed (lines 7-10), and S * e = S e when MatchJoin terminates.From the analysis above, the correctness of MatchJoin follows.That is, the If condition is verified.

Proof of Invariant
Putting these together, we have shown Theorem 2(1).
Fig. 3: Answering pattern queries using views Complexity.To complete the proof of Theorem 2(2), we provide a detailed worst-case time complexity analysis for algorithm MatchJoin as follows.
( Observe that when a match is removed from V(G), it will never be put back, i.e., V(G) is monotonically decreasing.Thus each match in V(G) is processed at most once.Note that if an edge e in G appears in different match set S e , each is considered as a distinct edge match.In addition, the index I can be initialized in O(|Q s ||V(G)|) time.As a result, the while loop (line 5) and for loop (line 6) together are bounded by These complete the proof of Theorem 2.

Remark. It takes O(|Q
In practice V(G) is much smaller than G. Indeed, for WebGraph in our experiments (Section 7), only 2 to 7 views are needed to answer Q s , and the overall size of V(G) is no more than 11% of the size of the entire WebGraph.
Optimization.MatchJoin may visit each S e multiple times.To reduce unnecessary visits, below we introduce an optimization strategy for MatchJoin.The strategy evaluates Q s by using ranks in Q s as follows.Given a pattern Q s , the strongly connected component graph G SCC of Q s is obtained by collapsing each strongly connected component SCC of Q s into a single node s(u).The rank r(u) of each node u in Q s is computed as follows: (a) Here E SCC is the edge set of the G SCC of Q s .The rank r(e) of an edge e = (u , u) in Q s is set to be r(u).
Bottom-up strategy.We revise MatchJoin by processing edges e in Q s following an ascending order of their ranks (lines 5-11).One may verify that this "bottom-up" strategy guarantees the following for the number of visits.
Lemma 3: For all edges e = (u , u) where u and u do not reach non-singleton SCC in Q s , MatchJoin visits its match set S e at most once using the bottom-up strategy. 2 Indeed, assume that algorithm MatchJoin visits an edge e = (u , u) at least twice.Then either MatchJoin does not follow a bottom-up strategy in the rank order, or at least u or u reaches a non-singleton SCC in Q s .In particular, when Q s is a DAG pattern (i.e., acyclic), MatchJoin visits each match set at most once, and the total visits are bounded by the number of the edges in Q s .As will be verified in Section 7, MatchJoin with optimization strategy runs 1.66 times faster on WebGraph than its counterpart without optimization over cyclic patterns.

PATTERN CONTAINMENT PROBLEMS
In the next two sections, we study how to determine whether Q s V. Our main conclusion is that there are efficient algorithms for these, with their costs as a function of |Q s | and |V|, which are typically small in practice, and are independent of data graphs and materialized views.
We start with three problems in connection with pattern containment, and establish their complexity.In the next section, we will develop effective algorithms for checking Q s V, and computing mapping λ from Q s to V.

Pattern containment problem.
The pattern containment problem is to determine, given a pattern query Q s and a set V of view definitions, whether Q s V.The need for studying this problem is evident: Theorem 2 tells us that Q s can be answered by using views of V if and only if Q s V.
The result below tells us that Q s V can be efficiently decided (see Table 1 for |Q s |, |V|, card(V)).We will prove the result in Section 5, by providing a checking algorithm.
Theorem 4: Given a pattern query Q s and a set V and if so, to compute an associated mapping λ from Q s to V. 2 A special case of pattern containment is the classical query containment problem [6].Given two pattern queries Q s1 and Q s2 , the latter is to decide whether Q s1 Q s2 , i.e., whether for all graphs G, Q s1 (G) is contained in Q s2 (G).Indeed, when V contains only a single view definition Q s2 , pattern containment becomes query containment.From this and Theorem 4 the result below immediately follows.

Corollary 5: The query containment problem for graph pattern queries is in quadratic time. 2
Like for relational queries (see, e.g., [6]), query containment is important in minimizing and optimizing pattern queries.Corollary 5 shows that the analysis for graph patterns time, as opposed to the intractability of its counterpart for relational conjunctive queries.
Minimal containment problem.As shown in Section 3, the complexity of pattern matching using views is dominated by |V(G)|.This suggests that we reduce the number of views used for answering Q s .Indeed, the less views are used, the smaller |V(G)| is.This gives rise to the minimal containment problem.Given Q s and V, it is to find a minimal subset V of V that contains Q s .That is, (1) Q s V , and (2) for any proper subset V of V , Q s V .
The good news is that the minimal containment problem does not make our lives harder.We will prove the next result in Section 5 by developing a cubic-time algorithm.
Minimum containment problem.One may also want to find a minimum subset V of V that contains Q s .The minimum containment problem, denoted by MMCP, is to find a subset V of V such that (1) Q s V , and (2) for any subset As will be seen shortly (Examples 6 and 7) and verified by our experimental study, MMCP analysis often finds smaller V than views found by algorithm minimal.
MMCP is, however, nontrivial: its decision problem is NPcomplete and MMCP is APX-hard.Here APX is the class of problems that allow PTIME algorithms with approximation ratio bounded by a constant (see [49] for APX).Nonetheless, we show that MMCP is approximable within O(log |E p |) in low polynomial time, where |E p | is the number of edges of Q s .That is, there exists an efficient algorithm that identifies a subset V of V with performance guarantees whenever

Theorem 7: The minimum containment problem is (1) NP-complete (its decision problem) and APXhard, but (2) it is approximable within O(log
Proof.We first show Theorem 7(1).We defer the proof of Theorem 7(2) to Section 5, where an approximation algorithm is provided as a constructive proof.
(I) We first show that MMCP is NP-complete.The decision problem of MMCP is to decide, given an integer k, whether there exists a subset V of V such that Q s V and card(V ) ≤ k.It is in NP since there exists an algorithm that guesses and checks V in PTIME (Theorem 4).We next show the NP-hardness by reduction from the NP-complete set cover problem (SCP) (cf.[42]).
Given a set X, a collection U of its subsets and an integer B, SCP is to decide whether there exists a B-element subset U of U that covers X, i.e., U ∈U = X.Given such an instance of SCP, we construct an instance of MMCP as follows: (a) for each x i ∈ X, we create a unique edge e xi with two distinct nodes u xi and v xi ; (b) we define a pattern query Q s as a graph consisting of all edges e xi defined in (a); (c) for each subset U j ∈ U and x i ∈ U j , we define a corresponding view definition V j that consists of all edges e xi from U j ; and (d) we set k = B.
The construction is obviously in PTIME.We next verify that there exists U with size no more than B if and only if there exists V of size no more than k that contains Q s .
(1) Assume that there exists a subset U of U that covers X with size less than B. Let V be the set of view definitions V j corresponding to U j ∈ U .One can verify that Q s V , since there exists a mapping λ that maps E p of Q s to powerset P( Vj ∈V E j ), such that for any data graph G, S e ⊆ e ∈λ(e) S e for all edges e ∈ E p .Moreover, card(V ) = |U | ≤ B = k.
(2) Conversely, if there exists V ⊆ V that contains Q s with no more than k view definitions, it is easy to see that the corresponding set U is a set cover with at most B elements.
As SCP is known to be NP-complete, so is MMCP.
(II) A problem is APX-hard if every APX problem can be reduced to it by PTIME approximation preserving reductions (AFP-reduction [49]).An AFP-reduction from a (minimization) problem Π 1 to another Π 2 is characterized by a function pair (f , g), where (a) for any instance , where function opt 1 () (resp.opt 2 ()) measures the quality of an optimal solution to I 1 (resp.I 2 ), and (b) for any solution , where function obj 1 () (resp.obj 2 ()) measures the quality of a solution to The APX-hardness of MMCP is verified by AFP-reduction from the minimum set cover (also denoted as SCP), the optimization version of SCP, which is known to be APX-hard (cf.[49]).
(1) We first define a function f .Given an instance I 1 of the SCP as its input, f outputs an instance I 2 of the MMCP following the same transformation in (I).Here opt 2 (I 2 ) ≤ opt 1 (I 1 ), where opt 1 () (resp.opt 2 ()) denotes the size of the minimum set cover (resp.the minimum view definition set) that covers X (resp.Q s ).It is easy to see that function f is in PTIME.
(2) We then construct function g.Given a feasible solution V for the instance I 2 , g outputs a corresponding U following the construction given in (1) above.Here obj 1 () (resp.obj 2 ()) measures the cardinality of the solution U to I 1 (resp.V to I 2 ).Note that g is trivially in PTIME.
We now show that (f, g) is an AFP-reduction from the SCP to MMCP.It suffices to show that (a) opt 2 (I 2 ) ≤ opt 1 (I 1 ), and that (b) obj 1 (I 1 , s 1 ) ≤ obj 2 (I 2 , s 2 ).Indeed, the construction guarantees an one-to-one mapping from the elements in a set cover for I 1 to the view definitions in a view definition set for I 2 .Thus, opt 2 (I 2 ) = opt 1 (I 1 ), and obj 1 (I 1 , s 1 ) = obj 2 (I 2 , s 2 ).Hence, (f, g) is indeed an AFP-reduction.It is known that SCP is APX-hard (cf.[49]); hence MMCP is also APX-hard. 2

DETERMINING PATTERN CONTAINMENT
In this section, we prove Theorems 4, 6 and 7(2) by providing effective (approximation) algorithms for checking pattern containment, minimal containment and minimum containment in Sections 5.1, 5.2 and 5.3, respectively.

Pattern Containment
We start with a proof of Theorem 4, i.e., whether To do this, we first propose a sufficient and necessary condition to characterize pattern containment.We then develop a cubic-time algorithm based on the characterization.
Sufficient and necessary condition.To characterize pattern containment, we introduce a notion of view matches.
Consider a pattern query Q s and a set V of view definitions.
then S e V is the nonempty match set of e V for each edge e V of V (see Section 2.1).We define the view match from V to Q s , denoted by M Qs V , to be the union of S e V for all e V in V.
The result below shows that view matches yield a characterization of pattern containment.
Proposition 8: For view definitions V and pattern Proof.(I) We first prove the If condition.Assume that E p = V∈V M Qs V , i.e., the union of all the view matches from V "covers" E p .We show that Q s V by constructing a mapping λ from E p to the edges in V, such that for all data graphs G and all edges e in Q s , S e ⊆ e ∈λ(e) S e .
We construct a mapping λ as a "reversed" view matching relation: for each edge e p of Q s , λ(e p ) is a set of edges e from the view definitions in V, such that for each edge e of a view definition V ∈ V, if e ∈ λ(e p ), then e p is a match of e in the view match V by definition; (ii) otherwise, for each pattern edge e p of Q s , there exists at least one edge e as a match of e p in G via simulation.Moreover, for any edge e (of view V) in λ(e p ), e p is in turn a match of e via simulation.One can verify that any match e of e p in G is also a match of e ∈ λ(e p ) in V. To see this, note that (a) e is a match of e p ; as a result, for any edge e p adjacent to e p , there exists an edge e adjacent to e such that e is a match of e p , by the semantics of simulation (Section 2); and (b) e p is a match of e ; hence similar to the argument for (a), for any edge e a adjacent to e in a view definition V, one can see that there exists an edge e p adjacent to e p such that e p is a match of e a , by the semantics of graph pattern matching via graph simulation.From (a) and (b) it follows that e is a match of e in the view extension.Hence, given any match e of e p from Q s in G, there exists an edge e in λ(e p ) from a view definition V, such that e is also a match of e in view extension V(G).That it, λ guarantees that Q s V, by definition.
Input: A pattern query Qs, and a set of view definitions V. Output: A subset V of V that minimally contains Qs.
if E = Ep then break ; 8. if E = Ep then return ∅; 9. for each M Qs V j ∈ S do 10.if there is no e ∈ M Qs V j such that M(e) \ {Vj} = ∅ then 11.
V := V \ {Vj}; update M; 12. return V ; Fig. 5: Algorithm minimal answer pattern queries using views as follows.Given a pattern Q s and a set V of views, we first determine whether Q s V by using algorithm contain; if so, for all graphs G, we compute Q s (G) by using algorithm MatchJoin.If Q s V, we compute approximate answers to Q s , as will be discussed in Section 6.All these are in time determined by |Q s |, |V| and |V(G)|, not by the size |G|.

Minimal Containment Problem
We now prove Theorem 6 by presenting an algorithm that, given Algorithm.The algorithm, denoted as minimal, is shown in Fig. 5. Given a pattern query Q s and a set V of view definitions, it returns either a nonempty subset V of V that minimally contains Q s , or ∅ to indicate that Q s V.
Algorithm minimal initializes (1) an empty set V for selected views, (2) an empty set S for view matches of V , and (3) an empty set E for edges in view matches.It also maintains an index M that maps each edge e in Q s to a set of views (line 1).Similar to algorithm contain, minimal first computes M Qs Vi for all V i ∈ V (lines 2-7).In contrast to contain that simply merges the view matches, it extends S with a new view match M Qs Vi only if M Qs Vi contains a new edge not in E, and updates M accordingly (lines 4-7).The for loop stops as soon as ).The algorithm then eliminates redundant views (lines 9-11), by checking whether the removal of V j causes M(e) = ∅ for some e ∈ M Qs Vj (line 10).If no such e exists, it removes V j from V (line 11).After all view matches are checked, minimal returns V (line 12).
Proof of Theorem 6.To complete the proof of Theorem 6, we next provide a detailed correctness and complexity analysis of algorithm minimal (Fig. 5).
Correctness.Given a pattern Q s and a set V of view definitions, minimal either returns an empty set indicating Q s V, or a subset V of V. We show the correctness of minimal by proving that (1) minimal always terminates, (2) it only removes "redundant" view definitions when it terminates, no redundant view definition is in V .
(1) Algorithm minimal repeats the for loop (lines 2-7, Fig. 5) at most card(V) times, and in each iteration it computes view Fig. 6: Containment for pattern queries matches and adds a view definition V i to a result set V .It then performs the redundant checking (lines 9-11) to remove all redundant view definitions, if there exists any.As V is a finite set, and its size is monotonically decreasing, the algorithm always terminates.
(2) We show that minimal only removes "redundant" view definitions.(a) Each time it computes the view match for a view definition V i (line 3), and it adds V i to V only if the corresponding match set of V i can cover edges in Q s that have not been covered yet (line 4).Hence when the for loop terminates, one can verify that either the union of the view matches from V covers E p (line 7), which indicates that V contains Q s , or Q s V (line 8), following Proposition 8. (b) A view definition V j is removed from V only when there already exist other view definitions in V "covering" every pattern edge e ∈ M Qs Vj (lines 10-11).Thus, minimal only removes redundant view definitions.
(3) When algorithm minimal terminates with Q s V, for any view definition V in V , there exists at least an edge e that can only be introduced by V to cover E p .By Proposition 8, this indicates that Q s V \ {V} for any V ∈ V. Thus minimal returns a minimal set that contains Q s .
Complexity.Similar to the complexity analysis of contain given above, algorithm minimal takes in total O(card(V) |Q s | 2 + |V| 2 + |Q s ||V|) time to compute all the view matches (line 3, Fig. 5).For each view match, the construction time for the index structure M (line 6) takes in total O(card(V)|Q s |) time (the outer loop is conducted at most card(V) times).The process for eliminating redundant view definitions (lines [9][10][11] The analysis above completes the proof of Theorem 6. 2 Example 6: Consider Q s and V given in Fig. 6.After M Qs Vi (i ∈ [1,4]) are computed, algorithm minimal finds that E already equals E p , and breaks the loop, where M is initialized to be {((A, B) :

Minimum Containment Problem
We next prove Theorem 7 (2), i.e., MMCP is approximable within O(log We give such an algorithm following the greedy strategy of the approximation of [49] for the set cover problem.The algorithm of [49] achieves an approximation ratio O(log n), for an n-element set.
Algorithm.The algorithm is denoted as minimum and shown in Fig. 7. Given a pattern Q s and a set V of views, minimum identifies a subset V of V such that (1) most of the true matches in Q s (G).The larger Acc that can be induced by Q s , the better.
for all G, Acc takes the maximum value 1.0.Observe that for any edge e in Q s , if e is covered by Q s , then for any G, the match set S e of e in Q s (G) is a subset of the match set S e of e in Q s (G); that is, Q s (G) finds all candidate matches of e in G.
Given a graph G, Q s (G) finds matches of these edges, which make a superset of the corresponding edge matches in Q s (G).Using Q s (G), one may further verify whether the matches of Q s (G) make true matches in Q s (G) by inspecting their neighboring nodes and edges.One may also treat Q s as a "relaxation" of Q s by dropping the condition imposed by edge (B, E), and take Q s (G) as approximate answers to Q s in G. 2 Computing maximally contained rewriting.It is known that finding maximally contained rewriting is intractable for SPC queries [45].In contrast, maximally contained rewriting can be efficiently found for graph pattern queries.We next prove Theorem 9 by providing an algorithm that computes a maximally contained rewriting for Q s using V.
Algorithm.Given a pattern query Q s and a set V of view definitions, the algorithm, denoted as maximal (not shown) finds a maximally contained rewriting of Q s using V as follows.Similar to algorithm contain, maximal maintains a set E of all nonempty view matches, initially empty.For each view definition V ∈ V, it iteratively computes view match M Qs V and merges it with E, until every view is visited.The difference from contain is that instead of checking whether E covers all edges in Q s as in contain, maximal simply generates an induced subgraph of Q s with edge set E, and returns it as the maximally contained rewriting Q s .
Example 9: Given Q s of Fig. 6 and V from Example 8, maximal finds a maximally contained rewriting Q s of Q s by computing the union of view matches E from each view in V to Q s .One may verify that as E includes a set of edges Proof of Theorem 9. Below we prove Theorem 9 by giving a detailed correctness and complexity analysis of maximal.
Correctness.It suffices to show that when algorithm maximal terminates, Q s is (1) a contained rewriting, and (2) a maximal contained rewriting.Obviously maximal always terminates since it visits each view in a finite set V once.
(1) When algorithm maximal terminates, Q s consists of only the edges of view matches from each view to Q s .Following Proposition 8, Q s , as a graph pattern query, is contained by V, i.e., Q s V.Moreover, maximal preserves the invariant that at any time, Q s contains only the edges from Q s .Hence Q s ⊆ Q s .This shows that Q s is a contained rewriting of Q s .Note that this also holds when Q s is empty.(2) We next show that Q s is maximal, i.e., there is no contained rewriting Q s of Q s using V such that Q s ⊂ Q s .Assume that such a contained rewriting Q s exists.Then there must exist a view definition V such that M Qs V is in the edge set of Q s but it is not in Q s .This cannot happen since algorithm maximal visits each view in V including V, and hence Q s includes M Qs V when algorithm maximal visits V.This completes the proof of Theorem 9.

Complexity
Remark.Note that a mapping from the edges of Q s to views can be readily induced by maximal, to be used as λ in MatchJoin for answering query using views.

EXPERIMENTAL EVALUATION
Using real-life data, we conducted four sets of experiments to evaluate (1) the efficiency and scalability of algorithm MatchJoin for graph pattern matching using views; (2) the effectiveness of optimization techniques for MatchJoin; (3) the efficiency and effectiveness of (minimal, minimum) containment checking; and (4) the efficiency, accuracy and scalability of our query-driven approximation scheme, using maximally contained rewriting.
Experimental setting.
(1) Real-life graphs.We used four real-life graphs: (a) Amazon [1], a product co-purchasing network with 548K nodes and 1.78M edges.Each node has attributes such as title, group and sales-rank, and an edge from product x to y indicates that people who buy x also buy y.(b) Citation [2] We generated a set of 12 view definitions for each reallife dataset.(a) For Amazon, we generated 12 frequent patterns following [34], where each view extension contains on average 5K nodes and edges.The views take 14.4% of the space of the  (3) Implementation.We implemented the following algorithms, all in Java: (1) contain, minimum and minimal for checking pattern containment; (2) maximal for finding the maximally contained rewriting; (3) Match, MatchJoin min and MatchJoin mnl for computing matches of patterns in a graph, where Match is the matching algorithm without using views [18], [28]; and MatchJoin min (resp.MatchJoin mnl ) revises MatchJoin by using a minimum (resp.minimal) set of views; (4) an algorithm MatchJoin max for approximately answering pattern queries, which invokes MatchJoin to evaluate maximally contained rewriting using views (Section 6); and (5) a version of MatchJoin min without using the ranking optimization (Section 3), denoted by MatchJoin nopt .
All the experiments were run on a machine with 2.0GHz Intel Xeon E5-2650 (8-core) CPU and 32GB memory, running windows server 2008 (64bit).Each experiment was run 5 times and the average is reported here.
Experimental results.We next present our findings.
Exp-1: Query answering using views.We first evaluated the performance of algorithms MatchJoin min and MatchJoin mnl , compared to Match [18], [28].Using real-life data, we studied the efficiency (resp.scalability) of MatchJoin min , MatchJoin As shown in Figures 9(f  Exp-3: Query containment.We evaluated the efficiency of pattern containment checking w.r.t.query complexity. Efficiency.We generated a set of patterns with size ranging from (4, 8) to (9,18), and node label from the alphabet Σ of WebGraph.Using the same set of views V as in Fig. 9(d), we evaluated the efficiency of contain, minimal and minimum.As shown in Fig. 9(h), (1) all three algorithms are efficient, e.g., it only takes contain 0.1s to decide whether a pattern with size (9, 18) is contained in V; (2) they all take more time over larger patterns, as expected; and (3) contain accounts for about 68.6% (resp.59.4%) of the time of minimal (resp.minimum) on average.Algorithm minimum vs minimal.To measure the effectiveness of minimum and minimal, we define and investigate two ratios: R 1 = |T min |/|T mnl | as the ratio of the time used by minimum to that of minimal; and R 2 = |Minimum|/|Minimal| for the ratio of the size of subsets of views found by minimum to that of minimal.Using the same view definitions and patterns as in Fig. 9(h), we varied the size of patterns from (6,6) to (9,18).As shown in Fig. 9(i), (1) minimum is efficient on all patterns, e.g., it takes about 0.15s over patterns with size (9, 18); (2) minimum is effective: while minimum takes up to 122% of the time of minimal (R 1 ), it finds substantially smaller sets of views, only about 60%-66% of the size of those found by minimal, as indicated by R 2 .Both algorithms take more time over larger patterns, as expected.
Exp-4: Approximate answers.We evaluated the efficiency, scalability and accuracy of MatchJoin max , by using maximally contained rewriting (Section 6) and real-life graphs.
Efficiency.Using the same sets of views as in Figures 9(c) and 9(d), we generated two sets of patterns, where none of them is contained in the corresponding view set.Varying |Q s |, we evaluated the efficiency of MatchJoin max and find the following.
(1) maximal is efficient.For example, it takes less than 50ms to find a maximally contained rewriting for a pattern with 8 nodes and 16 edges (not shown).(2) As shown in Figures 9(j 2) The accuracy of MatchJoin max is not sensitive to the pattern size; instead, it is determined by how much a maximally contained rewriting "covers" the pattern query.For example, we found (not shown) that the accuracy of MatchJoin max is on average 0.63 when the rewriting "missed" two edges in the pattern query, and it increases to 0.82 when only one query edge is missed.
Scalability.We evaluated the scalability of MatchJoin max and Match, in the same setting as in Fig. 9(e).As shown in Fig. 9(l), (1) MatchJoin max scales better with |Q s | than Match; and (2) MatchJoin max takes only 4.4% of the time of Match when the scale factor is 0.1, and the saving is more significant for larger |G|.
Summary.From the experimental results we find the following.
(1) Answering pattern queries using views is effective in querying large graphs.For example, by using views, pattern matching via graph simulation is 23.2 times faster than computing matches directly on WebGraph.(2) Our view-based matching algorithms scale well with the query and data size.Moreover, they are much less sensitive to the size of data graphs.(3) It is efficient to determine whether a pattern query can be answered using views.In particular, our approximation algorithm for minimum containment effectively reduces redundant views.(4) Our optimization strategy further makes the view-based matching up to 1.66 times faster.
(5) When patterns are not contained by views, our query-driven approximation scheme evaluates the queries efficiently with reasonable accuracy.For example, MatchJoin max is 24.8 times faster than Match, with accuracy 0.73 over large Web graph.

CONCLUSION
We have studied graph simulation using views, from theory to algorithms.We have proposed a notion of pattern containment to characterize what pattern queries can be answered using views, and provided such an efficient matching algorithm.We have also identified three fundamental problems for pattern containment, established their complexity, and developed effective (approximation) algorithms.When a pattern query is not contained in available views, we have developed efficient algorithms for computing maximally contained rewriting using views to get approximate answers.Our experimental results have verified the efficiency and effectiveness of our techniques.These results extend the study of query answering using views from relational and XML queries to graph pattern queries.
Our techniques can be readily extended to variants of graph simulation.Take strong simulation [36] as example, MatchJoin only needs to check (lines 7-11), for each pattern edge (u , u) and its match (v , v) in S, whether for each pattern edge (u , u ), there is a match (v , v ), with time complexity unchanged.
The study of graph pattern matching using views is still in its infancy.One issue is to decide what views to cache such that a set of frequently used pattern queries can be answered by using the views.Techniques such as adaptive and incremental query expansion [48] may apply.Another issue concerns view-based pattern matching via subgraph isomorphism.The third topic is to find a subset V of V such that V (G) is minimum for all graphs G. Finally, to find a practical method to query "big" social data, one needs to combine techniques such as view-based, distributed, incremental, and compression methods.

Fig. 9 :
Fig. 9: Performance evaluation Amazon dataset.(b) For Citation, we designed 12 views to search for papers and authors in computer science.The view extensions account for 12% of the Citation graph.(c) We generated 12 views, shown in Fig. 8, to find videos on Youtube, where each node is associated with a Boolean condition, specified by e.g., age (A), length (L), category (C), rate (R) and visits (V ).Each view extension has about 700 nodes and edges, accounting for 4% of Youtube.(d) On WebGraph, we designed 12 views to search Web pages, where the view extensions account for 11% of WebGraph.
mnl and Match, by varying |Q s | (resp.|G|).Efficiency.Figures 9(a), 9(b), 9(c) and 9(d) show the results on Amazon, Citation, YouTube and WebGraph, respectively, where the x-axis represents pattern size (|V p |, |E p |).The results tell us the following.(1) MatchJoin min and MatchJoin mnl substantially outperform Match: they are on average 8.1 and 5.2 times faster than Match over all real-life graphs, respectively.(2) While all the algorithms spend more time on larger patterns, MatchJoin min and MatchJoin mnl are less sensitive to patterns than Match, as they reuse previous computation cached in the views.(3) The larger the graphs are, the more substantial improvement of MatchJoin min and MatchJoin mnl is over Match.For example, MatchJoin min (resp.MatchJoin mnl ) is 23.2 (resp.13.3) times faster than Match on WebGraph, and 2.7 (resp.2.3) times faster on smaller Amazon.Scalability.Using WebGraph, we evaluated the scalability of MatchJoin min , MatchJoin mnl and Match.Fixing |Q s | = (4, 6), we varied |G| by using scale factors from 0.1 (i.e., 0.1 times of original graph size) to 1.0.The results are reported in Fig. 9(e), from which we can see the following.(1) MatchJoin min scales best with |G|, and is 1.73 times faster than MatchJoin mnl .This verifies that evaluating pattern queries by using less views significantly reduces computation time.The results are consistent with the observation of Figures 9(a), 9(b), 9(c) and 9(d).Exp-2: Optimization techniques.Varying the size of DAG (resp.cyclic) patterns, we evaluated the effectiveness of the optimization strategy given in Section 3, by comparing the performance of MatchJoin min and MatchJoin nopt on Citation (resp.WebGraph).

Fig. 10 :
Fig. 10: Performance evaluation since more invalid matches can be removed by the strategy.This explains why MatchJoin min works better on WebGraph than on Citation, since WebGraph is denser than Citation.
) and 10(a), MatchJoin max substantially outperforms Match in running time: it is on average 24.8 (resp.4.2) times faster than Match on WebGraph (resp.Youtube).(3) The running time of MatchJoin max is much less sensitive to |Q s | compared to Match.Accuracy.We report the accuracy (F-measure, Section 6) of MatchJoin max in Figures 9(k) and 10(b) on WebGraph and Youtube, respectively.We found the following.(1) MatchJoin max finds approximate answers with high accuracy.The Acc is 0.73 (resp.0.65) on WebGraph (resp.Youtube) on average.( Acknowledgments.Fan is supported in part by NSFC 61133002, 973 Program 2012CB316200 and 2014CB340302, Shenzhen Peacock Program 1105100030834361, Guangdong Innovative Research Team Program 2011D005, EPSRC EP/J015377/1 and EP/M025268/1, and a Google Faculty Research Award.Wang is supported in part by NSFC 61402383 and 71490722, Sichuan from a view V to Qs |Qs| (resp.|V|) size (total number of nodes and edges) of Qs (resp.view definition V) : Algorithm MatchJoin yields Q s (G)shown in the table below, as the final result.
MatchJoin spends O(|Q s |) time to initialize an empty set M (lines 1-2).It next merges matches in V(G) via the mapping λ (lines 3-4).Note that the size of λ(e) is bounded by Σ V∈V |V|.The merge process hence takes in total O(|Q s ||V(G)|) time.
[5][6][7][8][9][10][11]atively removes invalid matches by conducting a fixpoint computation (lines[5][6][7][8][9][10][11].Given a match (v , v) in V(G), MatchJoin verifies its validity, i.e., whether it carries over toQ s (G) in the current iteration, in O(|V(G)|) time; this is because at most Σ e1=(u ,u1)∈Ep S e1 + Σ e2=(u,u2)∈Ep S e2 matches have to be inspected,which is bounded by O(|V(G)|).To speed up the validity checking, MatchJoin employs an index structure I as a hash-table, which keeps track of a set of keyvalue pair.Each key is a pair of nodes (u, v), where u is in Q s and v can match u.Each value corresponding to the key (u, v) is a set of pattern edges and their match set (e = (u, u 2 ), S e ).The index dynamically maintains the key-value pairs: (1) for each node v, if there exists an edge e emitting from u with S e = ∅, then I(u, v) is set as ∅, and (2) given a match (v, v 2 ) of e = (u, u 2 ), if I(u, v) or I(u 2 , v 2 ) is already empty, no further checking is needed, and (v, v 2 ) can be removed from S e .Following this, it takes MatchJoin constant time (rather than linear time) to check the validity of a match (lines 7,9).

Theorem 9 :
Given a pattern Q s and a set V of view definitions, it is in O(card(V)|Q s | 2 + |V| 2 + |Q s ||V|) time to find a maximally contained rewriting of Q s using V.2 . Algorithm maximal is the same as algorithm contain except the last step for constructing Q s .It takes O(card(V) |Q s | 2 + |V| 2 + |Q s ||V|) time to compute all the view matches and merge those nonempty matches.The construction of Q s with edge set E takes at most O(|Q s |) time.Hence, it is in total O(card(V)|Q s | 2 + |V| 2 + |Q s ||V|) time to compute Q s , having the same complexity as contain.
[3] DAG (directed acyclic graph) with 1.4M nodes and 3M edges, in which nodes represent papers with attributes such as title, authors, year and venue, and edges denote citations.(c)YouTube[5],arecommendationnetworkwith 1.6M nodes and 4.5M edges.Each node is a video with attributes such as category, age and rate, and each edge from x to y indicates that y is in the related list of x.(d) WebGraph[3], a web graph including 118.1M nodes and 1.02B edges, where each node represents a web page with id and domain.(2)Pattern and view generator.We implemented a generator for graph pattern queries, controlled by three parameters: the number |V p | of pattern nodes, the number |E p | of pattern edges, and label f v from an alphabet Σ of labels taken from corresponding real-life graphs.We use (|V p |, |E p |) to denote the size of a pattern query.
(2)nd 9(g), (1) MatchJoin min is more efficient than MatchJoin nopt for all the patterns.For example, MatchJoin min is 1.46 (resp.1.66)timesfaster than MatchJoin nopt for DAG (resp.cyclic)patterns on average.(2)Theimprovement becomes more substantial when |Q s | gets larger.This is because for larger patterns, the bottom-up strategy used in MatchJoin min can eliminate redundant matches more quickly.(3) The optimization strategy works even better on denser big graphs,