1 Introduction

Real-life data and relationship among data can be conveniently modelled as a graph structure. For instance, in biological and chemical datasets, it is more appropriate to model them as graphs. Hence comes the need of mining interesting patterns from these graph datasets and mining frequent subgraph is one of the key research areas [1, 4].

Existing methods to mine frequent subgraph patterns either use an Apriori approach [2, 3] or a Pattern-Growth approach [5]. The Apriori approach results in redundant candidate generation which is very expensive. AGM [2] or FSG [3], the most common algorithms using the Apriori approach suffer from this overhead. The AGM [2] algorithm uses an iterative vertex-based candidate generation method that increases the substructure size by one vertex at each iteration. Two size-k frequent graphs are merged only if they have the same size-(k-1) subgraphs. The newly formed candidate includes the size-(k-1) subgraph in common and the additional two vertices from the two size-k patterns. On the other hand the FSG [3] algorithm adopts a similar edge-based candidate generation strategy that increases the substructure size by one edge after each iteration. The popular Pattern-Growth algorithm gSpan adopts DFS search as opposed to BFS used inherently in Apriori-like algorithms like AGM or FSG. Each graph is assigned an unique minimum DFS Code, and a hierarchical search tree is constructed based on this. After a pre-order DFS traversal of this tree, gSpan discovers all frequent subgraphs satisfying the minimum support threshold.

For real-time decision making problems like automated traffic control and dynamic route teller system, fraud/anomaly detection etc., we need a subgraph mining algorithm that can mine the subgraphs very quickly. However, the search for a subgraph in a graph is an NP-Complete problem and requires extensive enumeration which is a huge overhead in existing methods, including gSpan. These factors motivated us to design an algorithm that can avoid redundant and false pattern checking as well as reducing costly isomorphism checking in order to discover all the resultant frequent subgraphs in real-time. The proposed algorithm in this paper, msiSpan (minimum sequential indexed Subgraph Pattern Mining), attempts to greatly reduce this overhead by indexing subgraphs with custom data structures.

2 Proposed Method

In this section, we propose msiSpan algorithm for mining subgraphs from several graph transactions with necessary definitions and example.

Definition 1

Suppose D is a graph database with N number of graph transactions that is, D = {\(G_1\), \(G_2\), \(G_3\),.....\(G_N\)}. The Minimum DFS Codes and corresponding DFS subscriptings of all the graph transactions are produced. The roots of all the base subscriptings are connected with a dummy root node. Now the dummy root node is labeled ‘0’ and the whole supergraph is now traversed in pre-order and all the nodes are labeled according to their discovery time. Thus the Minimum Supergraph is formed which represents the whole database.

Definition 2

A compact data structure called Minimum Sequential Index is proposed to index the occurrences of the rightmost paths of the currently mined frequent subgraphs. For a particular subgraph S, the Minimum Sequential Index of S is denoted as MSI(S) = {\(d_1\), \(d_2\), \(d_3\),....\(d_k\)}, where k is the number of transactions in the graph database that contains S. Each \(d_i\) is a list containing the occurrences of the rightmost path of S in the Minimum Supergraph for the corresponding graph transaction.

Consider a graph database, D = {\(G_1\), \(G_2\), \(G_3\),.....\(G_N\)}. \(S_{i}\) is the set of subgraphs consisting only one edge and i is the lexicographical position of the DFS code of the subgraph. The steps of our proposed algorithm is as follows:

  1. 1.

    Find the Minimum DFS Code of all the graph transactions. The infrequent edges are replaced by insignificant edges. Once an edge is replaced by an insignificant edge, it is not considered for further analysis.

  2. 2.

    The frequent subgraphs with one edge are identified and arranged sequentially in increasing DFS lexicographic order.

  3. 3.

    The positions of the subgraphs found in step 2 are stored in Minimum Sequential Index as MSI(\(S_{i}\)). Here, \(S_{i}\) are the subgraphs containing only one edge and i = 1,2,3....,n where n is the number of frequent subgraphs containing one edge, sorted in increasing DFS lexicographic order.

  4. 4.

    The frequent 1 edge subgraphs are extended by adding forward edges to every node in the rightmost path and both backward and forward edges from the rightmost node in DFS lexicographic order. In our method, the number of extension candidates is significantly reduced with the help of Minimum Sequential Index and Minimum Supergraph. From the MSI, the positions of a node on the rightmost path can be retrieved. From the Minimum Supergraph, the adjacent nodes of any node in the rightmost path can be found. Only those frequent edges are used to extend nodes in the rightmost path which are adjacent to the corresponding node in the Minimum Supergraph.

  5. 5.

    If an edge can be added to a node on the rightmost path, then the node in the rightmost path of the newly formed subgraph is added to its Minimum Sequential Index. The number of lists in the Minimum Sequential Index is the support count of that subgraph. If it is greater than or equal to the minimum support threshold then it is added to the set of frequent subgraphs. Otherwise it is pruned since no frequent subgraphs can be generated by extending this subgraph.

    Checking the frequency from the Minimum Sequential Index is the most significant contribution of our work. This technique is efficient since it counts the frequency of a candidate subgraph after extending one edge at a time from the occurrences of its parent subgraph. On the other hand, gSpan checks for the whole subgraph in all transactions of the projected database and this is an NP-Complete problem.

  6. 6.

    After mining all the descendants of a subgraph, its children are extended in DFS lexicographic order. This recursive process continues following the extension rules mentioned above, until no more frequent subgraph can be mined. If a subgraph cannot be extended further then the recursive function backtracks.

  7. 7.

    After all the frequent subgraphs containing a subgraph \(S_{i}\) are mined, then all the edges labeled \(S_{i}\) in the Minimum Supergraph are replaced by insignificant edges. This essentially shrinks the Minimum Supergraph resulting in faster processing in the next steps.

It is known that if two graphs have the same Minimum DFS Code then they are isomorphic [5]. Thus when a subgraph with Minimum DFS Code equal to the Minimum DFS Code of any previously found subgraph is found, we prune it thereby avoiding duplicate subgraphs.

Fig. 1.
figure 1

A graph database.

Fig. 2.
figure 2

Graph database after labeling.

Fig. 3.
figure 3

Minimum DFS Code with their corresponding DFS Trees.

Fig. 4.
figure 4

Minimum Supergraph after removing the non-frequent edges.

Fig. 5.
figure 5

Frequent subgraphs with 1 edge and their Minimum Sequential Indices (for msiSpan).

Fig. 6.
figure 6

Descendants of \(S_a\) in case of msiSpan.

Let us consider the graph database in Fig. 1 (Let, Minimum Support Threshold, \(\sigma \) = 2/3). The edges representing single and double bonds are labelled as ‘1’ and ‘2’, respectively as shown in Fig. 2.

The Minimum DFS Codes with the corresponding base subscriptings of the graph transactions are shown in Fig. 3. The constructed Minimum Supergraph after adding the roots of the base subscriptings with a common root and replacing the non-frequent edges with insignificant edges is shown in Fig. 4. The frequent subgraphs with one edge (in DFS lexicographically sorted order), their Minimum DFS Codes and Minimum Sequential Indices are shown in Fig. 5.

Figure 6 shows the generated descendants of subgraph \(S_a\). Here, from MSI(\(S_a\)), we see that the rightmost node ‘C’ occurred in the Minimum Supergraph in positions 1, 2, 6, 7, 12 and 13. From the Minimum Supergraph it can be seen that they are connected with the nodes labeled ‘N’, ‘S’, ‘O’. Therefore, only those frequent edges from the rightmost node ‘C’ are extended. Likewise, the next rightmost node ‘C’ will be extended. Finally we get the descendant subgraphs \(S_e\), \(S_f\), \(S_g\), \(S_h\), \(S_i\) and \(S_j\). Since the DFS Codes of \(S_h\), \(S_i\) and \(S_j\) are not as same as their Minimum DFS Codes, they are pruned. The other three are found frequent. Here the frequency checking cost is significantly reduced as we only check at the occurrences of \(S_a\) from MSI(\(S_a\)) in the Minimum Supergraph. This recursive process is continued until all the frequent subgraphs are mined.

3 Experimental Results

A comprehensive performance study is performed in our experiments on real life datasets (Chemical-340, MOLT-4 (Active) and MCF-7 (Active)) found from a websiteFootnote 1. All experiments of our algorithm and gSpan have been performed on a 2.40 GHz Intel Pentium Core 2 Duo PC with 2 GB RAM, running Windows 7 Operating System. Both algorithms were implemented using C/C++ and on the same environment and machine. Our experiments show that our proposed method is more efficient when compared with the well-known graph mining algorithm gSpan by an order of magnitude.

Figure 7(left) depicts the performance of our proposed method msiSpan against gSpan with scalability (dataset size vs time) for the Chemical-340 dataset. The dataset size is varied keeping the Minimum Support Threshold at 10%. The required processing time increases with the increasing dataset size as expected. Our proposed method is proved to be more efficient.

Figure 7(middle) depicts the performance of our proposed method msiSpan against gSpan with scalability (dataset size vs time) for the MOLT-4 (Active) dataset. Here, the dataset size is varied for a constant Minimum Support Threshold of 45%. Here also, the required processing time increases with the increasing dataset size as expected. Our proposed method is proved to be more efficient than the existing method.

In Fig. 7(right), the performance of msiSpan is compared with that of gSpan with respect to scalability (dataset size vs time) for the MCF-7 (Active) dataset. The dataset size is varied for a fixed Minimum Support Threshold of 40%. Again, the required processing time increases with the increasing dataset size. msiSpan is proved to be more efficient when compared to gSpan.

Figure 8(left) depicts the performance of our proposed method msiSpan against gSpan with efficiency (Minimum Support Threshold vs time) for the Chemical-340 dataset. The minimum support threshold is varied with respect to time as the dataset size is fixed. In case of both of the algorithms the required processing time decreases with increased Minimum Support Threshold as expected. msiSpan is proved to be more effective and efficient.

Figure 8(middle) depicts the performance of msiSpan against gSpan with efficiency (Minimum Support Threshold vs time) for the MOLT-4 (Active) dataset. The minimum support threshold is varied with respect to time as the dataset size is fixed. The required processing time decreases with increased Minimum Support Threshold. msiSpan is proved to be more efficient than the existing method.

In Fig. 8(right), the performance of msiSpan is compared with that of gSpan with respect to efficiency (Minimum Support Threshold vs time) for the MCF-7 (Active) dataset. The minimum support threshold is varied with respect to time as the dataset size is fixed. The required processing time decreases with increased Minimum Support Threshold. msiSpan is proved to be more efficient when compared to gSpan.

Fig. 7.
figure 7

Database Size vs Time (left:Chemical-340, middle:MOLT-4 (Active), right:MCF-7 (Active)).

Fig. 8.
figure 8

Minimum Support Threshold vs Time(left:Chemical-340, middle:MOLT-4 (Active), right:MCF-7 (Active)).

4 Conclusions

Our research objective is to propose a time-efficient algorithm that mines frequent subgraphs from a graph database and our proposed algorithm, msiSpan, achieves this goal in an efficient manner. We have implemented our algorithm successfully and compared its performance with respect to gSpan, the most widely used algorithm to mine frequent subgraphs. We present several experimental results analyzing the performance of our algorithm. We have a plan to use distributed memory to reduce the memory overhead as our future work.