1 Introduction

Pattern mining has widely used applications in a lot of areas such as association rule mining [4, 13, 18, 24], sequence mining [19, 21], and others [3, 14]. Association rule mining is to mine relationships among items in a transaction database. An association rule has form XY in that X, Y are itemsets. A class association rule is an association rule whose right hand side (Y) is a class label.

Class association rule mining was first proposed by [12]. After that, a large number of methods related to this problem, such as Classification based on Multiple Association Rules (CMAR) [10], Classification based on Predictive Association Rules (CPAR) [27], Multi-class, Multi-label Associative Classification (MMAC) [20], ECR-CARM [23], and CAR-Miner [16] have been proposed. Results of classification based on association rule mining are often more accurate than those obtained based on ILA and decision tree [9, 12, 22].

All above studies simply focused on solving the problem of class association rule (CAR) mining based on batch processing approaches. In reality, datasets typically change due to operations such as addition, deletion, and update. Algorithms for effectively mining CARs from incremental datasets are thus required. The naïve method is to re-run the CAR mining algorithm on the updated dataset. The original dataset is often very large, whereas the updated portion is often small. Thus, this approach is not effective because the entire dataset must be re-scanned. In addition, previous mining results cannot be reused. Therefore, an efficient algorithm for updating the mined CARs when some rows are inserted into the original dataset need to be developed to solve this problem.

This work focuses on solving the problem of CAR mining from an incremental dataset (i.e., new records are added to the original dataset).

Main contributions are as follows:

  1. 1.

    The CAR-Miner algorithm [16] is used to build the MECR-tree for the original dataset.

    The concept of pre-large itemsets (i.e. itemsets which do not satisfy the minimum support threshold, but satisfy the lower minimum support threshold) is applied to avoid re-scanning the original dataset [5, 11].

  2. 2.

    When a new dataset is inserted, only information of the nodes on the MECR-tree including Obidset, count,and pos, need to be updated. During the update process, nodes which are frequent or pre-large in the original dataset but are not frequent in the updated dataset are pruned by simply processing each node. However, this task is time-consuming if many nodes on a given branch of the tree need to be removed. Therefore, a theorem is developed to eliminate such nodes.

The rest of the paper is organized as follows. Section 2 presents basic concepts of CAR mining. Section 3 presents problems related to CAR mining and frequent itemset mining from incremental datasets. Section 4 presents the proposed algorithm while Section 5 provides an example to illustrate its basic ideas. Section 6 shows experimental results on some standard datasets. Conclusions and future work are described in Section 7.

2 Basic concepts

Let D be a training dataset which includes n attributes A 1,A 2,…,A n and ∣D∣ objects. Let C={c 1,c 2,…,c k } be a list of class labels in D. An itemset is a set of pairs, denoted by {(A i1,a i1),(A i2,a i2),…,(A im,aim)} , where A ij is an attribute and aij is a value of A ij.

A class association rule r has the form {(A i1,ai1), …,(A im,aim)}→c, where {(A i1,ai1),…,(A im,aim)} is an itemset and cC is a class label. The actual occurrence of rule r in D, denoted ActOcc(r), is the number of records in D that match the left-hand side of r. The support of a rule r, denoted Sup(r), is the number of records in D that match r’s left-hand side and belong to r’s class.

Object Identifier (OID): OID is an object identifier of a record in D.

Example 1

Consider rule r={(B,b1)}→y from the dataset shown in Table 1. ActOcc(r) = 3 and Sup(r) = 2 because there are three objects with B=b1, where 2 objects belong to y.

Table 1 Example of a training dataset

3 Related works

3.1 Mining class association rules

This section introduces existing algorithms for mining CARs in static datasets, namely CBA [12], CMAR [10], ECR-CARM [23], and CAR-Miner.

Table 2 Summary of existing algorithms for mining CARs

The first study of CAR mining was presented by [12]. The authors proposed CBA-RG, an Apriori-like algorithm, for mining CARs. To build a classifier based on mined CARs, an algorithm, named CBA-CB, was also proposed. This algorithm is based on heuristic to select the strongest rules to form a classifier. Li, Han, and Pei proposed a method called CMAR in 2001 [10]. CMAR uses the FP-tree to mine CARs and uses the CR-tree to store the set of rules. The prediction of CMAR is based on multiple rules. To predict a record with an unlabeled class, CMAR obtains the set of rules R that satisfies that record and divides them into l groups corresponding to l existing classes in R. A weighted χ2 is calculated for each group. The class with the highest weighted χ2 is selected and assigned to this record. [10] proposed the MMAC method. MMAC uses multiple labels for each rule and multiple classes for prediction. Antonie and Zaïane proposed an approach which uses both positive and negative rules to predict classes of new samples [1]. Vo and Le [23] presented the ECR-CARM algorithm for quickly mining CARs. CAR-Miner, an improved version of ECR-CARM proposed by Nguyen et al. in 2013 [16], has a significant improvement in execution time compared to ECR-CARM. Nguyen et al. [11] proposed a parallel algorithm for fast mining CARs. Several methods for pruning and sorting rules have also been proposed [10, 12, 15, 20, 23, 27].

These approaches are used for batch processing only; i.e., they are executed on the integration of original and inserted datasets. In reality, datasets often change via the addition of new records, deletion of old records, or modification of some records. Mining knowledge contained in the updated dataset without re-using previously mined knowledge is time-consuming, especially if the original dataset is large. Mining rules from frequently changed datasets is thus challenging problem.

3.2 Mining association rules from incremental datasets

One of the most frequent changes on a dataset is data insertion. Integration datasets (from the original and inserted) for mining CARs may have some difficulties of execution time and storage space. Updating knowledge which has been mined from the original dataset is an important issue to be considered. This section reviews some methods relating to frequent itemset mining from incremental datasets.

Cheung et al. [2] proposed the FUP (Fast UPdate) algorithm. FUP is based on Apriori and DHP to find frequent itemsets. The authors categorized an itemset in the original and inserted datasets into two categories: frequent and infrequent. Thus, there are four cases to consider, as shown in Table 3.

Table 3 Four cases of an itemset in the original and inserted datasets [2]

In cases 1 and 4, the original dataset does not need to be considered to know whether an itemset is frequent or infrequent in the updated dataset. For case 2, only the new support count of an itemset in the inserted dataset needs to be considered. In case 3, the original dataset must be re-scanned to determine whether an itemset is frequent since supports of infrequent itemsets are not stored.

Although FUP primarily uses the inserted data, the original dataset is still re-scanned in case 3, which requires a lot of effort and time for large original datasets. In addition, FUP uses both frequent and infrequent itemsets in the inserted data, so it can be difficult to apply popular frequent itemset mining algorithms. Thus, a large number of itemsets must be mined for comparison with previously mined frequent itemsets in the original dataset. In order to minimize the number of scans of the original dataset and the large number of generated itemsets from the new data, Hong et al. [5] proposed the concept of pre-large itemsets. A pre-large itemset is an infrequent itemset, but its support is larger than or equal to a lower support threshold. In the concept of pre-large itemsets, two minimum support thresholds are used. The first is upper minimum support S U (is also the minimum support threshold) and the second is the lower minimum support S L. With these two minimum support thresholds, an itemset is placed into one of three categories: frequent, pre-large, and infrequent. Thus, there are 9 cases when considering an itemset in 2 datasets (original and inserted), as shown in Table 4.

Table 4 Nine cases of an itemset in the original and inserted datasets when using the pre-large concept

To reduce the number of re-scans of the original dataset, the authors proposed the following safe threshold formula f (i.e., if the number of added records does not exceed the threshold, then the original dataset does not need to be considered):

$$ f=\left\lfloor \frac{\left( S_{U}-S_{L} \right)\times \mid D \mid|}{1-S_{U}} \right\rfloor $$
(1)

where ∣D∣ is the number of records in the original dataset.

In 2009, Lin et al. proposed the Pre-FUFP algorithm for mining frequent itemsets in a dataset by combining the FP-tree and the pre-large concept [11]. They proposed an algorithm that updates the FP-tree when a new dataset is inserted using the safety threshold f. After the FP-tree is updated, the FP-Growth algorithm is applied to mine frequent itemsets in the whole FP-tree (created from the original dataset and inserted data). The updated FP-tree contains the entire resulting dataset, so this method does not reuse information of previously mined frequent itemsets and thus has to re-mine frequent itemsets from the FP-tree. Some effective methods for mining itemsets in incremental datasets based on the pre-large concept have been proposed, such as methods based on Trie [7] and the IT-tree [8], method for fast updating frequent itemset lattice [25], and method for fast updating frequent closed itemset lattice [6, 26]. Summary of algorithms for incremental mining is shown in Table 5.

Table 5 Summary of algorithms for incremental mining

4 A novel method for updating class association rules in incremental dataset

Algorithms presented in Section 3.2 are applied for mining frequent itemsets. It is difficult to modify them for the CAR mining problem because they are simply applied to the 1 st phase of association rule mining (updating frequent itemsets when new transactions are inserted into the dataset), with the 2 nd phase still based on all frequent itemsets to generate association rules.

Unlike association rule mining, CAR mining has only one phase to calculate the information of nodes, in which each node can generate only one rule whose support and confidence satisfy thresholds. Updated information of each node (related to Obidset, count, and pos) is much more complex than information of an itemset (related to only the support). This section presents a method that mines class association rules based on the concept of pre-large itemsets. The proposed method uses CAR-Miner to build the MECR-tree in the original dataset with few modifications; the new algorithm is called Modified-CAR-Miner (Fig. 1). The MECR-tree is generated with the lower minimum support threshold and then the safety threshold f is calculated using (1). A function of tree traversal for generating rules satisfying the upper minimum support threshold is then built. The number of rows of the inserted dataset is compared with the safety threshold f. If the number of rows does not exceed the safety threshold f, the tree is updated by changing the information of nodes. Otherwise, Modified-CAR-Miner is called to rebuild the entire tree based on the original dataset and the inserted dataset. A theorem for pruning tree nodes is also developed to reduce the execution time and storage space of nodes.

Fig. 1
figure 1

Modified CAR-Miner algorithm for incremental mining

4.1 Modified CAR-Miner algorithm for incremental mining

a) MECR-tree structure

Each node in the MECR-tree contains an itemset (att, values) that includes the following information [16]:

  1. i.

    Obidset: a set of object identifiers containing the itemset

  2. ii.

    (#c1, #c2,, #ck): a set of integer numbers where # c i is the number of objects that belong to class c i

  3. iii.

    pos: an integer number that stores the position of class ci such that #ci=max i∈[1,k]{# c i }, i.e., p o s=arg max i∈[1,k]{# c i }, the maximum position is underlined in black.

More details about the MECR-tree can be found in Nguyen et al. [16].

b) Modified CAR-Miner algorithm

Input: Original dataset D, two minimum support thresholds SU and SL, and minimum confidence threshold minConf

Output: Class association rules mined from D that satisfy S U and minConf

Figure 1 shows a modified version of CAR-Miner for incremental mining. The main differences compared to CAR-Miner are on lines 3, 19, and 22. When the procedure GENERATE-CAR is called to genarate a rule (line 3), the input for this function must be S U. Therefore, lines 22 and 24 consider whether the support and confidence of the current node satisfy S U and minConf, respectively. If the conditions hold, a rule is generated (line 25). Line 19 considers whether the support of a new node satisfies S L. If so, this node is added into the tree.

4.2 Algorithm for updating the MECR-tree in incremental datasets

Theorem 1

Given two nodes l 1 and l 2 in the MECR-tree, if l 1 is a parent node of l 2 and Sup \((\mathop l\nolimits_{1} .itemset\,\to\,_{\mathop{C}\nolimits_{l_{1}.pos}} ) \) <minSup, then Sup \((\mathop l\nolimits_{2} .itemset\,\to\,_{\mathop{C}\nolimits_{l_{2}.pos}})\) <minSup.

Proof

Because l 1 is a parent node of l 2, it implies that l 1.itemset ⊂ l 2.itemset ⇒ all Obidsets containing l 1.itemset also contain l 2.itemset or l 1.Obidsetl 2.Obidset \(\Rightarrow \forall \textit {i}, l_{\mathrm {1}}\).count il 2.count ior \(\max \mathop {\left \{ {\mathop l\nolimits _{1} .\mathop {count}_{i} } \right \}}\nolimits _{i=1}^{k} \,\ge \,\max \mathop {\left \{ {\mathop l_{2} .\mathop {count}\nolimits _{i} } \right \}}\nolimits _{i=1}^{k} \) ⇒ Sup\((\mathop l\nolimits_{1} .itemset\,\to\,_{\mathop{C}\nolimits_{l_{1}.pos}} )\ge \) Sup\((\mathop l\nolimits_{2} .itemset\,\to\,_{\mathop{C}\nolimits_{l_{2}.pos}})\). Because Sup\((\mathop{l}\nolimits_{1}.itemset\,\to\,_{\mathop{C}\nolimits_{l_{1} .pos}}) \) <minSup \(\Rightarrow \) Sup\((\mathop l\nolimits _{2} .itemset\,\to \,\mathop C\nolimits _{\mathop l\nolimits _{1} .pos} )\)<minSup.

Based on Theorem 1, infrequent nodes from the MECR-tree are pruned to reduce updating time.

Input: :

The MECR-tree built from original dataset D in which L r is the root node, inserted dataset D’, two minimum support thresholds S U and S L , and minConf

Output: :

Class association rules that satisfy S U and minConf from D+D

Figure 2 shows the algorithm for updating the MECR-tree when dataset D is inserted. Firstly, the algorithm checks whether the MECR-tree was created by considering the number of rows in the original dataset. If the number of rows is 0 (line 1), it means that the tree was not created, so Modified-CAR-Miner is called to create the MECR-tree for dataset D’ (line 2) and the safety threshold f is computed using (1) (line 3). If the number of rows in dataset D’ is larger than the safety threshold f, then the algorithm calls Modified-CAR-Miner to generate rules in the entire dataset D+D’ (lines 4 and 5), and then computes the safety threshold f based on the integrated dataset D+D’ (line 6). If the number of rows in dataset D’ is not greater than f, then the algorithm simply updates the MECR-tree as follows. First, all Obidsets of nodes on the tree (line 8) are deleted to ensure that the algorithm works on the inserted dataset only. Second, the UPDATE-TREEprocedure with root node L r is called to update the information of nodes on the tree (line 9). Third, the procedure GENERATE-RULES with L r is called to generate rules whose supports and confidences satisfy S U and minConf(line 10). The safety threshold f is reduced to f - |D’|(line 11). Finally, the original dataset is supplemented by D’ (line 12).

Fig. 2
figure 2

Algorithm for updating the MECR-tree for an incremental dataset

Consider procedure UPDATE-TREE. First, this procedure changes the information of nodes in the first level of the MECR-tree whose itemsets are contained in the inserted dataset and marks them (line 13). Line 15 checks whether each child node l i of the root node L r is unmarked (i.e., it is not changed from the original dataset). Its child nodes are then checked using Theorem 1 (line 16). If the support of l i does not satisfy S L, then l i and all its child nodes are deleted by Theorem 1 (lines 17 and 18). Otherwise, l i is marked and its support satisfies S L. Then, information of Obidset, count, and pos of all child nodes of l i is updated (lines 19-27). If the support of O satisfies S L, then it is marked (lines 28 and 29). After all the child nodes of l i have been checked, the procedure UPDATE-TREE is recursively called to update all child nodes of l i(line 30).

Consider procedure TRAVERSE-TREE-TO-CHECK. This procedure checks whether the support of l satisfies S L. If not, then l and all its child nodes are deleted using Theorem 1. Otherwise, the child nodes of l are checked in the same way as l and the procedure is called recursively until there are no more nodes. Procedure DELETE-TREE deletes this node with all its child nodes. Procedure GENERATE-RULES checks each child node l of the root node L rto generate a rule r, if the support of r satisfies minSup (line 40), this procedure checks the confidence of r (line 42), if the confidence of r saisfies minConf then r is added into the set of rules (CARs). After that, this procedure is called recursively to generate all rules from the sub-tree l.

5 Example

Assume that the dataset in Table 1 is the original dataset and the inserted dataset has one row, as shown in Table 6.

Table 6 Inserted dataset

With S U = 25 % and S L = 12.5 %, the process of creating and updating the MECR-tree is illustrated as follows.

Figure 3 shows the results of Modified-CAR-Miner obtained using S L for the dataset in Table 1. Because 25 % 8 = 2 and 12.5 % × 8 = 1, nodes whose supports are greater than or equal to 2 are frequent and those whose supports are equal to 1 are pre-large. Consequently, nodes enclosed by the dashed line contain pre-large itemsets.

Fig. 3
figure 3

Results of Modified-CAR-Miner with S U= 25 % and S L= 12.5 %

The safety threshold f is computed as follows: \(f\,=\,\left \lfloor {\frac {\left ({0.25\,-0.125} \right )\times 8}{1-\,0.25}} \right \rfloor =1.\)

Consider the inserted dataset. Because the number of rows is 1, |D’| = 1 ≤ f = 1, the algorithm updates the information of nodes in the tree without re-scanning original dataset D.

The process of updating the MECR-tree is as follows. The first level of the MECR-tree is updated. The results are shown in Fig. 4.

Fig. 4
figure 4

Results of updating level 1 of the MECR-tree

The new row (row 9) contains items (A, a1), (B, b1), and (C, c2), so only three nodes in the MECR-tree are changed (marked by T in Fig. 4).

  • Consider node \(l_{\mathrm {i\vspace *{0.5pt}}} = \,\left ({\begin {array}{l} \,\,\,\,\left ({1,a1} \right ), \\ \left ({\left \{ 9 \right \},\left \{ {\underline {4},2} \right \}} \right ) \\ \end {array}} \right )\). Because it has been changed, it needs to be checked with its following nodes (only changed nodes) in the same level to update information:

    • With node \(l_{\mathrm {j}} = \,\left ({\begin {array}{l} \,\,\,\,\left ({2,b1} \right ), \\ \left ({\left \{ 9 \right \},\left \{ {\underline {3},1} \right \}} \right ) \\ \end {array}} \right )\), the node created from these two nodes (after update) is\(\,\left ({\begin {array}{l} \,\,\left ({3,a1b1} \right ), \\ \left ({\left \{ 9 \right \},\left \{ {\underline {2},1} \right \}} \right ) \\ \end {array}} \right )\). This node has count[pos] = 2 ≥S L ×(8 + 1) = 1.125, so it is marked as a changed node.

    • With node \(l_{\mathrm {j}} = \,\left ({\begin {array}{l} \,\,\,\,\left ({4,c2} \right ), \\ \left ({\left \{ 9 \right \},\left \{ {1,\underline {3}} \right \}} \right ) \\ \end {array}} \right )\), the node created from these two nodes (after update) is\(\,\left ({\begin {array}{l} \,\,\left ({5,a1c2} \right ), \\ \left ({\left \{ 9 \right \},\left \{ {1,\underline {2}} \right \}} \right ) \\ \end {array}} \right )\). This node has count[pos] = 2 ≥S L ×(8 + 1) = 1.125, so it is marked as a changed node.

      Results obtained after considering node \(\,\left ({\begin {array}{l} \,\,\,\,\left ({1,a1} \right ), \\ \left ({\left \{ 9 \right \},\left \{ {\underline {4},2} \right \}} \right ) \\ \end {array}} \right )\) are shown in Fig. 5.

      Fig. 5
      figure 5

      Results obtained after considering node\(\,\left ({\begin {array}{l} \,\,\left ({1,a1} \right ), \\ \left ({\left \{ 9 \right \},\left \{ {\underline {4},2} \right \}} \right ) \\ \end {array}} \right )\)

      After considering node\(\,\left ({\begin {array}{l} \,\,\,\,\left ({1,a1} \right ), \\ \left ({\left \{ 9 \right \},\left \{ {\underline {4},2} \right \}} \right ) \\ \end {array}} \right )\), the algorithm is called recursively to update all its child nodes.

    • Consider node \(\,\left ({\begin {array}{l} \,\,\left ({3,a1b1} \right ), \\ \left ({\left \{ 9 \right \},\left \{ {\underline {2},1} \right \}} \right ) \\ \end {array}} \right )\). Because count[pos] ≥S L ×(8 + 1), it will check with node \(\,\left ({\begin {array}{l} \,\,\left ({5,a1c2} \right ), \\ \left ({\left \{ 9 \right \},\left \{ {1,\underline {2}} \right \}} \right ) \\ \end {array}} \right )\). The node created from these two nodes (after update) is\(\,\left ({\begin {array}{l} \,\left ({7,a1b1c2} \right ), \\ \,\,\left ({\left \{ 9 \right \},\left \{ {\underline {1},1} \right \}} \right ) \\ \end {array}} \right )\). This node is deleted because count[pos] = 1 < S L ×(8 + 1). All its child nodes are deleted because their supports are smaller than S L ×(8 + 1).

    • Consider node\(\,\left ({\begin {array}{l} \,\,\left ({3,a1b2} \right ), \\ \left ({\left \{ \emptyset \right \},\left \{ {0,\underline {1}} \right \}} \right ) \\ \end {array}} \right )\). Because count[pos] < S L ×(8 + 1),\(\,\left ({\begin {array}{l} \,\,\left ({3,a1b2} \right ), \\ \left ({\left \{ \emptyset \right \},\left \{ {0,\underline {1}} \right \}} \right ) \\ \end {array}} \right )\) is deleted. All its child nodes are also deleted by using Theorem 1.

    • Similarly, node \(\,\left ({\begin {array}{l} \,\,\left ({5,a1c1} \right ), \\ \left ({\left \{ \emptyset \right \},\left \{ {\underline {1},0} \right \}} \right ) \\ \end {array}} \right )\) is also deleted.

  • Consider node\(\,\left ({\begin {array}{l} \,\,\,\,\left ({1,a2} \right ), \\ \left ({\left \{ \emptyset \right \},\left \{ {1,\underline {2}} \right \}} \right ) \\ \end {array}} \right )\). This node is not deleted. All its child nodes are deleted by checking support and using Theorem 1.

Do the same process for nodes \(\Big \{\,\left ({\begin {array}{l} \,\,\,\,\left ({2,b1} \right ), \\ \left ({\left \{ 9 \right \},\left \{ {\underline {3},1} \right \}} \right ) \\ \end {array}} \right ),\,\left ({\begin {array}{l} \,\,\,\,\left ({2,b2} \right ), \\ \left ({\left \{ \emptyset \right \},\left \{ {0,\underline {2}} \right \}} \right ) \\ \end {array}} \right ),\,\left ({\begin {array}{l} \,\,\,\,\left ({2,b3} \right ), \\ \left ({\left \{ \emptyset \right \},\left \{ {\underline {2},1} \right \}} \right ) \end {array}} \right ),\) \( \left ({\begin {array}{l} \,\,\,\,\left ({4,c1} \right ), \\ \left ({\left \{ \emptyset \right \},\left \{ {\underline {1},1} \right \}} \right ) \\ \end {array}} \right ),\,\left ({\begin {array}{l} \,\,\,\,\left ({4,c2} \right ), \\ \left ({\left \{ 9 \right \},\left \{ {1,\underline {3}} \right \}} \right ) \\ \end {array}} \right ),\,\left ({\begin {array}{l} \,\,\,\,\left ({4,c3} \right ), \\ \left ({\left \{ \emptyset \right \},\left \{ {\underline {3},0} \right \}} \right ) \\ \end {array}} \right )\Big \},\)the MECR-tree after all updates is shown in Fig. 6.

Fig. 6
figure 6

Updated MECR-tree

The number of nodes in Fig. 6 is significantly less than that in Fig. 3 (14 versus 33). The MECR-tree can thus be efficiently updated.

Note that after the MECR-tree is updated, the safety threshold f is decreased by 1 \(\Rightarrow f =\) 0, which means that if a new dataset is inserted, then the algorithm re-builds the MECR-tree for the original and inserted datasets. D = D + D’ includes nine rows.

6 Experimental results

Experiments were conducted on a computer with an Intel Core i3 2.53-GHz CPU and 2 GB of RAM running Windows 7. Algorithms were coded in C# 2010.

Experimental datasets were obtained from the UCI Machine Learning Repository (http://mlearn.ics.uci.edu). Table 7 shows the characteristics of the experimental datasets.

Table 7 Characteristics of experimental datasets

The experimental results from Figs. 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, to 20 show that CAR-Incre is more efficient than CAR-Miner in most cases, especially in large datasets or large minSup. Examples are Poker-hand (a large number of records) or Chess (minSupis large).

Fig. 7
figure 7

Run times for CAR-Miner and CAR-Incre for Breast dataset (S U= 1 %, S L= 0.9 %, 0.8 %, 0.7 %, 0.6 %, 0.5 %) for each inserted dataset (two records for each insert)

Fig. 8
figure 8

Total run time for CAR-Miner and CAR-Incre for Breast dataset (S U= 1 %, S L= 0.9 %, 0.8 %, 0.7 %, 0.6 %, 0.5 %)

Fig. 9
figure 9

Run times for CAR-Miner and CAR-Incre for German dataset (S U= 3 %, S L= 2.8 %; 2.6 %; 2.4 %; 2.2 %; 2.0 %) for each inserted dataset (two rows for each insert)

Fig. 10
figure 10

Total run time for CAR-Miner and CAR-Incre for German dataset (S U= 3 %, S L= 2.8 %; 2.6 %; 2.4 %; 2.2 %; 2.0 %)

Fig. 11
figure 11

Run times for CAR-Miner and CAR-Incre for Lymph dataset (S U= 3%, S L= 2.8 %; 2.6 %; 2.4 %; 2.2 %; 2.0 %) for each inserted dataset (one record for each insert)

Fig. 12
figure 12

Total run time for CAR-Miner and CAR-Incre for Lymph dataset (S U= 3 %, S L= 2.8 %; 2.6 %; 2.4 %; 2.2 %; 2.0 %)

Fig. 13
figure 13

Run time for CAR-Miner and CAR-Incre for Led7 dataset (S U= 1 %, S L= 0.9 %; 0.8 %; 0.7 %; 0.6 %; 0.5 %) for each inserted dataset (two records for each insert)

Fig. 14
figure 14

Total run time for CAR-Miner and CAR-Incre for Led7 dataset (S U= 1 %, S L= 0.9 %; 0.8 %; 0.7 %; 0.6 %; 0.5 %)

Fig. 15
figure 15

Run times for CAR-Miner and CAR-Incre for Vehicle dataset (S U= 1 %, S L= 0.9 %; 0.8 %; 0.7 %; 0.6 %; 0.5 %) for each inserted dataset (two records for each insert)

Fig. 16
figure 16

Total run time for CAR-Miner and CAR-Incre for Vehicle dataset (S U= 1 %, S L= 0.9 %; 0.8 %; 0.7 %; 0.6 %; 0.5 %)

Fig. 17
figure 17

Run times for CAR-Miner and CAR-Incre for Chess dataset (S U= 60 %, S L= 59 %; 58 %; 57 %; 56 %; 55 %) for each inserted dataset (8 records for each insert)

Fig. 18
figure 18

Total run time for CAR-Miner and CAR-Incre for Chess dataset (S U= 60 %, S L= 59 %; 58 %; 57 %; 56 %; 55 %)

Fig. 19
figure 19

Run time for CAR-Miner and CAR-Incre for Poker-hand dataset (S U= 3 %, S L= 2.8 %; 2.6 %; 2.4 %; 2.2 %; 2.0 %) for each inserted dataset (2,000 records for each insert)

Fig. 20
figure 20

Total run time for CAR-Miner and CAR-Incre for Poker-hand dataset (S U= 3 %, S L= 2.8 %; 2.6 %; 2.4 %; 2.2 %; 2.0 %)

6.1 The impact of number of records

CAR-Incre is very efficient when the number of records in the original dataset is large. For example, consider the Poker-hand dataset. Updating 2,000 data rows on the MECR-tree built from 980,000 rows of the original dataset takes about 0.09 seconds, and mining rules using the batch process on 982,000 rows takes about 34 seconds (Fig. 19). Figure 20 shows a comparison of total run time of two algorithms (CAR-Miner and CAR-Incre) in some S L (S U = 3).

However, when we compare the run times of two algorithms in a small dataset such as Lymph, we can see that CAR-Miner is faster than CAR-Incre with all thresholds.

6.2 The impact of S L

The most important issue in CAR-Incre is how to choose a suitable S L. If S L is large then f must be small. In this case, the algorithm needs to rescan the original dataset many times, which is time-consuming. If S L is small, many frequent and pre-large itemsets are generated and the tree must be updated. This is also very time-consuming. To the best of our knowledge, there is not any method for choosing a suitable S L value. Therefore, we conducted experiments with different S L values to determine the influence of S L values on the run time.

Consider Breast dataset with S U = 1. The total time of 7 runs of CAR-Miner is 0.369s. We change S L = {0.9, 0.8, 0.7, 0.6, 0.5} and the run times are {0.344, 0.390, 0.219, 0.202, 0.234}respectively, the best threshold is 0.6.

Consider German dataset with S U = 3. The total time of 10 runs of CAR-Miner is 6.5s. We change S L = {2.8, 2.6, 2.4, 2.2, 2.0}and the run times are {6.147, 4.774, 4.274, 4.898, 4.368}respectively, the best threshold is 2.4.

Similarly, the best threshold of Led 7 is 0.5 (S U = 1), that of Vehicle is 0.7 (S U = 1), that of Chess is 59 (S U = 60), that of Poker-hand is 2.0 (S U = 3).

6.3 The impact of minSup

The safety threshold f is proportional to minSup (S U ). If minSup is large, safety threshold f is also large. Therefore, the number of inserted records is small, we do not need to rescan the original dataset; we update only information of nodes on the tree with new data. For example, consider Chess dataset with S U = 60 %, and S L = 59 %. The original dataset is inserted eight times with eight rows each time but the safety threshold f is still satisfied (f=((0.6−0.59)×3132)/(1−0.6 ) = 78 records).

7 Conclusions and future work

This paper proposed a method for mining CARs from incremental datasets. The proposed method has several advantages:

  • The MECR-tree structure is used to generate rules quickly.

  • The concept of pre-large itemsets is applied to CAR mining to reduce the number of re-scans on the original dataset.

  • A theorem for quickly pruning infrequent nodes in the tree is developed to improve the process of updating the tree.

One of weaknesses of the proposed method is that it must re-build the MECR-tree for the original and inserted datasets when the number of rows in the inserted dataset is larger than the safety threshold f. This approach is not appropriate for large original datasets. Thus, the algorithm is being improved to avoid re-scanning the original dataset. In addition, a lattice structure helps to identify redundant rules quickly. It is possible to update the lattice when a dataset is inserted will thus be studied in the future.