Genetic algorithm-based strategy for identifying association rules without specifying actual minimum support
Introduction
Association rule mining (Agrawal, Imielinski, & Swami, 1993) plays a key role in boosting the research, development and application of data mining techniques. This leads to a great many significant technologies and methodologies for identifying association rules. These techniques mainly focus on algorithm scale-up and data reduction for the efficiency issue.
However, these mining algorithms are mostly based on the assumption that users can specify the minimum support appropriate to their databases, and thus referred to as the Apriori-like algorithms (Zhang and Zhang, 2002, Zhang et al., 2004). Han, Wang, Lu, and Tzvetkov (2002) have pointed out that setting the minimum support is quite subtle, which can hinder the widespread applications of these algorithms. Our own experiences of mining transaction databases also tell us that the setting is by no means an easy task.
Recognizing the above limitation, a great many techniques have been developed to attack this issue. Han et al. (2002) designed a strategy to mine top-k frequent closed patterns for effectiveness and efficiency. Piatetsky-Shapiro and Steingold proposed a method to identify only the top 10% or 20% of the prospects with the highest score for marketing (Piatetsky-Shapiro & Steingold, 2000). Roddick and Rice (2001) presented the independent thresholds and context dependent thresholds to measure time-varying interestingness of events for temporal data. Hipp and Guntzer (2002) explored a new mining approach that postpones constraints from mining to evaluation. Wang, He, Cheung, and Chin (2001) designed a confidence-driven mining strategy without minimum support to identify new patterns. Cheung and Fu (2004) developed a technique to identify frequent itemsets without the support threshold. Zhang et al. (2004) advocated a fuzzy-logic-based method to acquire user threshold of minimum support for mining association rules. However, most of these approaches attempt to avoid specifying the minimum support. Some of them are actually confidence-driven methods. The last approach solves the minimum support issue by coding technique. All of these efforts provide a good insight into the difficulty of specifying a minimum-support constraint.
In this paper we use genetic algorithm to identify association rules without minimum support. Genetic algorithm is efficient for global search work, especially when the search space is too large to use a deterministic search method. It imitates the mechanics of natural species evolution with genetics principles, such as natural selection, crossover, and mutation. In particular, our approach does not require users to specify the minimum-support threshold. Instead of generating an unknown number of interesting rules in traditional mining models, only the most interesting rules are returned according to the interestingness measure defined by the fitness function. Obviously, this method is database-independent in contrast to these Apriori-based algorithms. This approach leads to (1) effectiveness and efficiency for global search; and (2) system automation, because our model does not require the user-specified threshold of minimum support.
The rest of this paper is organized as follows. We start with brief recalls of both concepts concerning about association rules and current work on genetic algorithm-based learning in Section 2. In Section 3, we present our genetic algorithm-based model for identifying association rules, including the encoding method, genetic operators, and the ARMGA algorithm. In Section 4, we expand the ARMGA algorithm to identify generalized association rules. In Section 5, we experimentally evaluate our approach. Finally, we conclude our work in Section 6.
Section snippets
Preliminaries
This section recalls some concepts concerning association rule mining, quantitative association rule, and genetic algorithms.
Identifying association rules with genetic algorithm
Let be the universal set of items. Then a transaction can be viewed as an itemset with variable length, and a database D can be defined as a set of transactions over I. Association rule is a k-rule if is a k-itemset.
From Section 2.1, we know that the traditional task of mining association rules is to find all rules , such that the supports and confidences of the rules are larger than, or equal to, the minimum support, minsupp, and the minimum confidence, minconf,
Expanding ARMGA for generalized association rules
Algorithm ARMGA is designed for Boolean association rule mining. This section will expand it to deal with generalized association rules.
Computations
We have conducted a set of experiments for evaluating the designed algorithms. For space, this section only reports on three groups of them.
Summary
We have designed a genetic algorithm-based strategy and its corresponding ARMGA/EARMGA algorithm. Our approach has delivered two benefits: (1) high-performance association rule mining; (2) system automation. Computation results show that our model can be taken as an alternative for effective association rule mining.
The most important difference between our algorithm and existing mining strategies is that our approach does not require the minimum-support threshold. The experimental results of
Acknowledgements
This work was supported in part by an Australian large ARC grant (DP0667060), a China NSF major research Program (60496327), China NSF grants (90718020, 60625204), a China 973 Program (2008CB317108), an Overseas Outstanding Talent Research Program of Chinese Academy of Sciences (06S3011S01), an Overseas-Returning High-level Talent Research Program of China Hunan-Resource Ministry, the MOE Project of Key Research Institute of Humanities and Social Sciences at Universities (07JJD720044), and
References (19)
- Agrawal, R., Imielinski, T., & Swami, T. (1993). Mining association rules between sets of items in large databases. In...
- Au, W., & Chan, K. (2002). An evolutionary approach for discovering changing patterns in historical data. In...
- Blake, C., & Merz, C. (1998). UCI repository of machine learning databases...
- Cheung, Y., & Fu, A. (2004). Mining frequent itemsets without support threshold: With and without item constraints....
- Fidelis, M., Lopes, H., & Freitas, A. (2000). Discovering comprehensible classification rules with a genetic algorithm....
A genetic algorithm for generalized rule induction
A survey of evolutionary algorithms for data mining and knowledge discovery
- Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. In Proceedings of the ACM...
- Han, J., Wang, J., Lu, Y., & Tzvetkov, P. (2002). Mining top-k frequent closed patterns without minimum support. In...