An Efficient Approach for Discovering Graph Entity Dependencies (GEDs)

Graph entity dependencies (GEDs) are novel graph constraints, unifying keys and functional dependencies, for property graphs. They have been found useful in many real-world data quality and data management tasks, including fact checking on social media networks and entity resolution. In this paper, we study the discovery problem of GEDs -- finding a minimal cover of valid GEDs in a given graph data. We formalise the problem, and propose an effective and efficient approach to overcome major bottlenecks in GED discovery. In particular, we leverage existing graph partitioning algorithms to enable fast GED-scope discovery, and employ effective pruning strategies over the prohibitively large space of candidate dependencies. Furthermore, we define an interestingness measure for GEDs based on the minimum description length principle, to score and rank the mined cover set of GEDs. Finally, we demonstrate the scalability and effectiveness of our GED discovery approach through extensive experiments on real-world benchmark graph data sets; and present the usefulness of the discovered rules in different downstream data quality management applications.

tive and efficient approach to overcome major bottlenecks in GED discovery.
In particular, we leverage existing graph partitioning algorithms to enable fast GED-scope discovery, and employ effective pruning strategies over the prohibitively large space of candidate dependencies. Furthermore, we define an interestingness measure for GEDs based on the minimum description

Introduction
In recent years, integrity constraints (e.g., keys [1] and functional dependencies (FDs) [2]) have been proposed for property graphs to enable the specification of various data semantics, and tackling of graph data quality and management problems. Graph entity dependencies (GEDs) [4,5] are new fundamental constraints unifying keys and FDs for property graphs.
A GED ϕ over a graph G is a pair, ϕ = (Q[ū], X → Y ), specifying the dependency X → Y over homomorphic matches of the graph pattern Q[ū] in G. Intuitively, since graphs are schemaless, the graph pattern Q[ū] identifies which the set of entities in G on which the dependency X → Y should hold.
In general, GEDs are capable of expressing and encoding the semantics of relational FDs and conditional FDs (CFDs), as well as subsume other graph integrity constraints (e.g., GKeys [1] and GFDs [2]).
GEDs have numerous real-world data quality and data management applications (cf. [4,5] for more details). For example, they have been extended and used for: fact checking in social media networks [20], entity resolution in graphs and relations [3,6,22], consistency checking [21], amongst others.
In this work, motivated by the expressivity and usefulness of GEDs in many downstream applications, we study the problem of automatically learning GEDs from a given property graph. To the best of our knowledge, at the moment, no discovery algorithm exists in the literature for GEDs, albeit discovery solutions have been proposed for GKeys and GFDs in [12] and [16] respectively. Unfortunately, both solutions cannot be directly used to find GEDs, without changing the pattern matching semantics (i.e., isomorphic vs homomorphic matches); and even then, [16,12] can only discover subsets of the valid GEDs in the given graph. Thus, the need for effective and efficient techniques for mining rules that capture the full GED semantics still exists.
The general goal in data dependency mining is to find an irreducible set (aka. minimal cover set) of valid dependencies (devoid of all implications and redundancies) that hold over the given input data. This is a challenging and often intractable problem for most dependencies, and GEDs are not an exception. In fact, the problem is inherently more challenging for GEDs than it is for other graph constraints (e.g. GFDs and GKeys) and traditional relational dependencies. The reason is three-fold: a) the presence of graph pattern as topological constraints (unseen in relational dependencies); b) the attribute sets of GED dependency have three possible literals (e.g., GFDs have 2, GKeys have 1, CFDs have 2); c) the implication and validation analyses-key tasks in the discovery process-of GEDs are shown NP-complete and co-NP-complete respectively (see [5] for details).
This paper proposes an efficient and effective GED mining solution. We observe that, two major efficiency bottlenecks in mining GEDs are on pattern discovery in large graphs, traversal of the prohibitively large search space of candidate dependencies. Thus, we leverage existing graph partitioning algorithms to surmount the first challenge, and employ several effective pruning rules and strategies to resolve the second.
We summarise our main contributions as follows. 1) We formalise the GED discovery problem (in Section 3). We extend the notions of trivial, reduced, and minimal dependencies for GEDs; and introduce a formal cover set finding problem for GEDs. 2) We develop a new two-phase approach for efficiently finding GEDs in large graphs (Setion 4). The developed solution leverages existing graph partitioning algorithms to mine graph patterns quickly, and find their matches. Further, we design a comprehensive attribute (and entity) dependency discovery architecture with effective candidate generation, search space pruning and validation strategies. Moreover, we introduce a new interestingness score for GEDs, for ranking the discovered rules.
3) We perform extensive experiments on real-world graphs to show that GED discovery can be feasible in large graphs, and demonstrate the effectiveness, scalability and efficiency of our proposed approach (Section 5).

Preliminaries
This section presents preliminary concepts and definitions. We use A, L, C to be denote universal sets of attributes, labels and constants respectively.
The definitions of graph, graph pattern, and matches follow those in [4,5].
Graph. We consider a directed property graph G = (V, E, L, F A ), where: (1) V is a finite set of nodes; (2) E is a finite set of edges, given by E ⊆ (3) each node v ∈ V has a special attribute id denoting its identity, and a label L(v) drawn from L; (4) every node v, has an associated list Graph Pattern. A graph pattern, denoted by Q[ū], is a directed graph , where: (a) V Q and E Q represent the set of pattern nodes and pattern edges respectively; (b) L Q is a label function that assigns a label to each node v ∈ V Q and each edge e ∈ E Q ; and (c)ū is all the nodes, called (pattern) variables in V Q . All labels are drawn from L, including the wildcard "*" as a special label. Two labels l, l ∈ L are said to match, denoted l l iff: (a) l = l ; or (b) either l or l is "*".
such that l l . We denote the list of all matches of Q[ū] in G by H(ū). An example of a graph, graph patterns and their matches are presented below in Example 1.   Graph Entity Dependencies. A graph entity dependency (GED) [4] ϕ is a pair ( x.id = y.id, then h(x).id = h(y).id.
A match h(ū) satisfies a set X of literals if h(ū) satisfies every literal We illustrate the semantics of GEDs via the sample graph and graph patterns in Figure 1 below.
Example 2 (GEDs). We define exemplar GEDs over the sample graph in Figure 1(a), using the graph patterns in Figure 1(b)-(e).

Problem Formulation
Given a property graph G, the general GED discovery problem is to find a set Σ of GEDs such that G |= Σ. However, like other dependency discovery problems, such a set can be large and littered with redundancies (cf. [16,22,44]). Thus, in line with [16], we are interested in finding a cover set of non-trivial and non-redundant dependencies over persistent graph patterns.
In the following, we formally introduce the GEDs of interest, and present our problem definition.
Persistent Graph Patterns. Graph patterns can be seen as "loose schemas" in GEDs [5]. It is therefore crucial to find persistent graph patterns in the input graph data for GED discovery. However, counting the frequency of sub-graphs is a non-trivial task, and can be intractable depending on the adopted count-metric (e.g., harmful overlap [30], and maximum independent sets [29]).
We adopt the minimum image based support (MNI) [45] metric to count the occurrence of a graph pattern in a given graph, due to its anti-monotone and tractability properties.
Let I = {i 1 , · · · , i m } be the set of isomorphisms of a pattern Q[ū] to a graph G. Let M (v) = {i 1 (v), · · · , i m (v)} be the set containing the (distinct) nodes in G whose functions i 1 , · · · , i m map a node v ∈ V . The MNI of Q in G, denoted by mni(Q, G), is defined as: We say a graph pattern Q[ū] is persistent (i.e., frequent) in a graph G, if mni(Q, G) ≥ τ , where τ is a user-specified minimum MNI threshold.
Representative Graph Patterns. For many real-world graphs, it is prohibitively inefficient to mine frequent graph patterns with commodity computational resources largely due to their sizes (cf. [31,32]). To alleviate this performance bottleneck, in our pipeline, we consider graph patterns that are frequent within densely-connected communities within the input graph.
We term such graph patterns representative, and adopt the Constant Potts Model (CPM) [35] as the quality function for communities in a graph.
Formally, the CPM of a community c is given by where e c and n c are the number of edges and nodes in c respectively; and γ > 0 is a (user-specified) resolution parameter-higher γ leads to more communities, and vice versa. Intuitively, the resolution parameter, γ, constrains the intra-and inter-community densities to be no less than and no more than the value of γ respectively.
Thus, we say a graph pattern Q[ū] is representative or (γ, τ )-frequent if it is τ -frequent within at least one γ-dense community c ∈ G.
Trivial GEDs. We say a GED ϕ : set of literals in X cannot be satisfied (i.e., X evaluates to false); or (b) Y is derivable from X (i.e., ∀ w ∈ Y , w can be derived from X by transitivity of the equality operator). We consider non-trivial GEDs.
Reduced GEDs. Given two patterns That is, Q is a less restrictive topological constraint than Q .
Thus, we say a GED ϕ is reduced in a graph G if: (a) G |= ϕ; and for any ϕ such that ϕ ϕ, G |= ϕ . We say a GED ϕ is minimal if it is both non-trivial and reduced.

Cover of GEDs. Given a set Σ of GEDs such that
We study the following GED discovery problem.
Definition 1 (Discovery of GEDs). Given a property graph, G, and userspecified resolution parameter and MNI thresholds γ, τ : find a cover set Σ c of all valid minimal GEDs that hold over (γ, τ )-frequent graph patterns in G.

The Proposed Discovery Approach
This section introduces a new and efficient GED discovery approach. components for: (a) finding representative and reduced graph patterns (i.e., "scopes") and their (homomorphic) matches in the graph; and (b) finding minimal attribute (entity) dependencies that holds over the graph patterns.
A snippet of the pseudo-code for the proposed GED discovery process is presented Algorithm 1, requiring two main procedures for the above-mentioned tasks (cf. lines 2 and 3-5 of Algorithm 1 respectively).

Graph Pattern Discovery & Matching
The first task in the discovery process is finding representative graph patterns, and their matches in the given graph. We present a three-phase process for the task, viz.: (i) detection of dense communities within the input graph; (ii) mining frequent graph patterns in communities; and (iii) finding matches of the discovered (and reduced) graph patterns in the input graph.

. Dense communities detection
The first step in our approach is straightforward: division of the input graph G into multiple dense communities, C = {c 1 , · · · , c k }. In particular, we employ the quality function in Equation 3.2 to ensure that the density of each community c i ∈ C, (i ∈ [1, k]) is no less than γ. In general, any efficient graph clustering algorithm can be employed for this task; and we adopt the Leiden graph clustering algorithm [37] with the CPM quality function to partition the input graph into γ-dense communities (cf. line 2 in Procedure 1).

Representative Graph Patterns Mining
Next, we take the discovered γ-dense communities, C = {c 1 , · · · , c k } as input, and return a set of frequent and reduced graph patterns, using isomorphic matching and the MNI metric (in Equation 3.1) to count patterns occurrences in the communities (cf. lines 3-6 of Procedure 1).
We adapt the GraMi graph patterns mining algorithm [28] to find the set, 10: S ← S ∪ (Q, H(Q, G)) 11: return S Q, of all τ -frequent patterns across c 1 , · · · , c k . As will be expected, Q may contain redundant patterns. Therefore, we designed a simple and effective algorithm, Function 1, to prune all redundancies in Q, and return a set Q of representative and reduced patterns in G (i.e., there does not exist any pair The input to Function 1 is a set Q of τ -frequent graph patterns in γ-dense communities (ie., (γ, τ )-frequent or representative graph patterns); and the output is a set Q of graph patterns without any redundancies.
if count(VF2U(Q(i), Q(j))) > 0 then 8: is associated with a graph pattern Q i ∈ Q, and denotes whether or not the associated graph pattern is redundant.
The variables, n, Q and T are initialised in line 1 of Function 1. The graph patterns in Q are then sorted from largest to smallest by the number of edges in every graph pattern (in line 2). Thereafter, we evaluate pairs of graph patterns for redundancy (in lines 3-9). In particular, the functions VF2U and count are used to check isomorphic matches between patterns at index i, j of Q, and compute the number of isomorphic matches returned respectively. We extended an efficient implementation 1 of the VF2 algorithm [43] through the relaxation of some criteria as VF2U. The main difference being, in VF2U, instead of induced subgraphs, the query graph pattern simply queries subgraphs in the target graph pattern which are isomorphic to it.
For the pairwise comparison of the graph patterns, the pattern at index j is considered as the query pattern whiles that at index i is the target pattern.
The algorithm iterates from the largest pattern in Q (line 3). In the round of iteration i, the largest pattern indexed i is used as the Q(i) if it is tagged with f alse. As Q(j), every pattern indexed from i + 1 to n − 1 is checked to observe if it is isomorphic to Q(i). VF2U returns the isomorphisms from Q(j) to Q(i), and count returns the size of the resulting set.
During every iteration, the pattern indexed j (j > i) is tagged as true

Homomorphic Pattern Matching
Given the set, Q , of reduced representative graph patterns from the previous step, we find their homomorphic matches in the input graph (cf.

Attribute (Entity) Dependency Discovery
Here, we mine dependencies over the pseudo-relational tables produced from the matches of the discovered graph patterns in the previous step. As depicted in Figure 2   That is, this phase allows any relevant pre-processing of the pseudorelational table of the matches of a graph pattern for efficient downstream mining tasks.

Partitioning
Next, we transform the input table into structures for efficient mining of rules. In the following, we extend the notions of equivalence classes and  We remark that appropriate preprocessing (based on domain knowledge) from the previous step ensures only semantically meaningful pattern variable pairs are considered for variable literals.  In the following example, we show the partitions and equivalence classes of a toy example.

(b) Equivalence Classes of Single Itemsets (a) Single Attribute Literal (Itemset) Generation
Example 3. Suppose pseudo-relational in Table 1 captures the attributes of the homomorphic matches of pattern Q 3 in Figure 1. Figure 5 shows the resulting partitions and equivalence classes. We remark that, in this example, no id literals are produced due to unique node identities in Table 1.

Candidate dependency generation
Here, we discuss the data structures and strategies for generating candidate dependencies. We extend and model the search space of the possible dependencies with the attribute lattice [39]. We adopt a level-wise top-down search approach for generating candidate rules, testing/validating them and using the valid rules to prune the search space.
In general, a node n in the lattice is a pair n = (X, π(X)). However, for non-redundant generation and pruning of candidate sets X, Y , we organise literal sets of each node under their respective pattern variable(s). For example, Figure 3 shows a snapshot of the lattice for generating candidate dependencies (for the running example in Table 1 and Figure 5. It shows all nodes in level 1, and some of the nodes for level 2 and 3. To generate the fist level, L [1], nodes in the lattice, we use the single attribute partition of variables in the previous step (i.e., output of Function 2).
Further, to derive nodes for higher levels of the lattice, we follow the following principles in Lemma 2 to ensure redundancy-free and valid literal set generation.
For every node pair n 1 , n 2 ∈ L[i] that results in a permissible node n ∈ L[i + 1], we establish the edges n 1 → n and n 2 → n representing candidate dependencies. All candidate dependencies are tested for validity in the next step. We illustrate the candidate dependency generation with our running example below.
Example 4 (Generating and searching the lattice space).

Validation of dependencies
We use a level-wise traversal of search space to validate the generated candidate rules between two successive levels in the lattice. That is, for any two levels L[i] and L[i + 1], correspond to the set of LHS, RHS candidates respectively. For any edge X → Y within LHS, RHS, we test the dependency X → Y \X. Function 3 performs this task, and its process is self-explanatory.
In Line 4, X i ⊂ Y j is true for all candidate dependencies (i.e., the edge X i → Y j exists). If the dependency holds (i.e., π(X i ) = π(Y j ) by Lemma 1),

Minimal cover set discovery and ranking of GEDs
In this phase of the GED mining process, we perform further analysis of all mined rules in Σ from the previous step to determine and eliminate all redundant or implied rules. Furthermore, we introduce a measuure of interestingness to score and rank the resulting cover set Σ c of GEDs.

Minimal cover set
From the foregoing, every GED ϕ ∈ Σ is minimal (cf., Section 3). However, Σ may not be minimal as there may exist redundant GEDs due to the transitivity property of GEDs. More specifically, consider the follow- and Σ is minimal.
Thus, it suffices to eliminate all transitively implied GEDs in Σ to produce a minimal cover Σ c of Σ. We use a simple, but effective process in Function 4 to eliminate all transitively implied GEDs in Σ, and produce Σ c . We create a graph Γ with all GEDs in Σ such that each node corresponds to unique literal sets of the dependencies (i.e., Lines 2, 3). We remark that, if there exists transitivity in Σ, triangles (aka. 3-cliques) will be formed in Γ (cf., , and a minimal set Σ c be returned (Lines 6-8).

Ranking GEDs
We present a measure to score the interestingness of GEDs as well as a means to rank rules in a given minimal cover Σ c . We adopt the minimum description length (MDL) principle [46] to encode the relevance of GEDs. The In line with MDL, the lower the rank of a GED, the more interesting it is. We perform an empirical evaluation of the score in the following section.

Empirical Study
In this section, we present an evaluation of the proposed discovery approach using real-world graph data sets. In the following, we discuss the setting of the experiments, examine the scalability of the proposed algorithm, compared our solution to relevant related works in the literature, and demonstrate the usefulness of some of the mined GEDs in various downstream applications.

Experimental Setting
Here, we cover the data sets used and the setting of experiments.
Data Sets. We used three real-world benchmark data sets with different features and sizes, summarised in Table 2 Exp-1a. In this experiment, fix the number of attributes on every node to not more than seven (7) in all three data sets; used up to the full graph of DBLP and IMDB data sets, and sampled a comparable size of YAGO4. In general, GED discovery in the DBLP data is the most efficient as it produces the least number of matches for its frequent patterns compared to the other data sets. This characteristics is captured by the density of the graphs (in Table 2). In other words, the more dense/connected a graph is, the more likely it is to find more matches for graph patterns. Consequently, the longer the GED discovery takes.
The distribution of discovered GEDs, for different sizes of the three data sets is presented in Figure 6 (b) -with the YAGO4 and DBLP data sets Exp-1b. Here, we examine the impact of graph pattern sizes on the discovery time. We use the number of edges in a graph pattern as its sizeconsidering patterns of size 2 to 5. For each data set, we mine GEDs over patterns size 2 to 5, using the full graph with up to seven (7) attributes per node.
The result of the time performance for different sizes of patterns are presented in Figure 6 (c). In this experiment, the IMDB and YAGO4 data sets produced comparable time performances, except for the case of patterns with size 5. In the exceptional case, we found significantly more matches of size-5 patterns in the IMDB data, which is reflected in the plot.
Exp-1c. This set of experiments examine the impact of the number of attributes on the discovery time in all data sets, varying the number of attributes on each node in the graphs from 2 to 7. To enable a clearer characterisation, we conduct the experiment over different pattern sizes.
The plots in Figure 6 (d) -(f) show the discovery time characteristics for all three data sets with increasing number of attributes for patterns with size 2 through 4 respectively. As can be seen from all three diagrams, the number of attributes directly affects the time performance irrespective of the pattern size.

Comparison to mining other graph constraints
The notion of GEDs generalises and sunbsumes graph functional dependencies (GFDs) and graph keys (GKeys). Thus, we show in this group of experiments that our proposal can be used for mining GFDs and GKeys via two simple adaptions. First, we replace the graph pattern matching semantics from homomorphism to isomorphism in accordance with the definitions of GFDs and GKeys. Second, we consider only id-literal for GKeys mining, and no id-literals in GEDs mining.
In Figure 7, we present a summary of the comparative analysis of mining GEDs, GFDs and GKeys. Parts (a) to (c) of the plot represent the relative time performance of mining the three different graph constraints over the show the performance for up to the three graph patterns (using the top three most frequent patterns in each data set).
As expected, the differences in the time performances come down to: a) the matching efficiency of homomorphism vs. isomorphism; and b) the number of literals considered for each dependency/constraint type. Surprisingly, the matching homomorphic match efficiency outweighs the number of literal types considered. Thus, in almost all cases, GED mining is more efficient even though it considers all three literals (compared to two for GFDs, and only one in GKeys) -as seen in the plots.
Furthermore, we show in Figure 7 (d) the number of discoveries for each constraint (for the YAGO4 data). As dipicted in the plot, the number of  GEDs mined is more in all categories -can be explained by the subsumption relationship. This pattern is represented in the other two data set, thus, we only show one for brevity of presentation.

Usefulness of mined rules
In this section, we present some examples of the discovered GEDs with a discussion of their potential use in real-world data quality and data management applications. Further, we discuss interestingness rankinng of the mined GEDs.
Examples of discovered GEDs. We manually inspected the mined GEDs and validated their usefulness. We present 3 GEDs found in the YAGO4 data, with patterns shown in Figure 8 as follows: Furthermore, σ 2 states that for any matches of the parent-child pattern Q 6 , the child x 1 and parent x 2 must share the same last name. The GED σ 3 claims that for any two sport events y 1 , y 2 in the 2018 Wuhan Open event series x; y 1 , y 2 must share the same event prefixName. Indeed, these GED are straightforward and understandable; and can be used for violation or inconsistency detection in the graph through the validation of the dependencies.
For example, σ 2 can be used to check all parent-child relationships that do not have the same surname, whiles σ 3 can be employed to find any sub-event in the '2018 Wuhan Open' that violate the prefix-naming constraints.
The rank of GEDs. In the following, we show a plot of the interestingness scores of the discovered GEDs, for three different α values (i.e. 0,0.5 Figure 9: Top ranked GEDs w.r.t. α settings and 1, respectively). For brevity, and lack of space, we show the results for only the top K = 20 GEDs in Figure 9. A lower interestingness score is representative of better GED, and vice versa. The plot shows the interestingness score is effected by the value of α, and ranking GEDs makes it easier for users to find meaningful GEDs. For α = 0, the rank score reflects the complexity of the GED, whereas α = 1 reflects the persistence of the rule in data. And, α = 0.5 combines both with equal weight.

Related Work
In this section, we review related works in the literature on the discovery of graph data constraints. Mining constraints in graph data has received increasing attention in recent years. In the following, we present some relevant related works in two broad categories.
6.1. Rule discovery in non-property graphs.
Most research on profiling and mining of non-property graph data focus on XML and RDF data (cf. [47]). For instance, [48,49,50] investigate the problem of mining (path) association rules in RDF data and knowledge bases (KBs), whereas [51,52,53,54] present inductive logic programming based rule miners over RDF data and KBs.
In contrast to the above-mentioned works, this paper studies the mining of functional (both conditional and entity) dependencies in property graphs.
However, our proposed method can be adapted for mining rules in RDF data, particularly, GEDs with constant literals.

Rule discovery in property graphs.
More related to this work, are techniques for rule discovery in property graphs. Examples of some notable works in this area include: 1) [17,27,55,56,57], which investigated the discovery of association rules in property graphs; and 2) [12,16,22,58] on mining keys and dependencies in property graphs-closest to this work. In particular, [12] presents a frequent sub-graph expansion based approach for mining keys in property graphs, whiles [16] proposes efficient parallel graph functional dependency discovery for large property graphs. It is noteworthy that, the discovered rules (GEDs) in this work subsume the semantics of rules returned in [12] (i.e., GKeys) and in [16] (i.e., GFDs).
Moreover, the work in [58] studies the discovery of temporal GFDs -GFDs that hold over property graphs over periods of time; and [22] studies the discovery of graph differential dependencies (GDDs)-GEDs with distance semantics-over property graphs. Although related, this work differs from mining temporal GFDs as we consider only one time period of graphs.
And, we do not consider the semantics of difference in data as captured in GDDs. Essentially, GEDs can be considered a special case of GDDs, where the distance is zero over all attributes.

Conclusion
In this paper, we presented a new approach for mining GEDs. The developed discovery pipeline seamlessly combines graph partitioning, nonredundant and frequent graph pattern mining, homomorphic graph pattern matching, and attribute/entity dependencies mining to discover GEDs in property graphs. We develop effective pruning strategies and techniques to ensure the returned set of GEDs is minimal, without redundancies. Furthermore, we propose an effective MDL-based measure to score the interestingness of GEDs. Finally, we performed experiments on large real-world graphs, to demonstrate the feasibility, effectiveness and scalability of our proposal.
Indeed, the empirical results are show our method is effective, salable and efficient; and finds semantically meaningful rules.