Elsevier

Expert Systems with Applications

Volume 40, Issue 17, 1 December 2013, Pages 6823-6836
Expert Systems with Applications

Mining frequent patterns and association rules using similarities

https://doi.org/10.1016/j.eswa.2013.06.041Get rights and content

Abstract

Most of the current algorithms for mining association rules assume that two object subdescriptions are similar when they are exactly equal, but in many real world problems some other similarity functions are used. Commonly these algorithms are divided in two steps: Frequent pattern mining and generation of interesting association rules from frequent patterns. In this work, two algorithms for mining frequent similar patterns using similarity functions different from the equality are proposed. Additionally, the GenRules Algorithm is adapted to generate interesting association rules from frequent similar patterns. Experimental results show that our algorithms are more effective and obtain better quality patterns than the existing ones.

Introduction

Association Rule Mining (Agrawal, Imielinski, & Swami, 1993) is an important task for Knowledge Discovery in Data. It has been used for marketing (Wong et al., 2005, Yunyan and Juan, 2010), crime analysis (Leong, Chan, Ng, & Shiu, 2008), intrusion detection (Ertoz et al., 2004), fraud detection (Sánchez, Vila, Cerda, & Serrano, 2009), disease diagnostic or analysis (Chaves et al., 2012, Chaves et al., 2013, Dua et al., 2009, Nahar et al., 2013, Patil et al., 2010), etc. Association rule mining consists in finding interesting “if-then” rules between feature value combinations in a dataset. An association rule X  Y, where X and Y are combinations of feature values (patterns), means that if X appears in an object then Y also appears in the same object. Commonly an association rule is considered interesting if its frequency and confidence are not less than user-specified frequency and confidence thresholds. The frequency of a rule X  Y is the frequency of the pattern XY in a dataset Ω; and its confidence is the fraction of objects in the dataset in which if X appears then Y also appears. Association rule mining consists of two fundamental steps: (I) Search of frequent patterns (patterns with frequency not less than a frequency threshold); (II) Construction of association rules from frequent patterns.

The first step (also called frequent pattern mining) is very important by itself because regularities (patterns) in data are discovered, and depending on the application these patterns could represent user profiles, modus operandi, common syndromes, risk factors, etc., in areas such as Marketing, Bioinformatics, Medicine, Network security, and others (Alatas et al., 2008, Hu et al., 2008, Kalpana and Nadarajan, 2008, LaRosa et al., 2008, Lopez et al., 2008, Xin and Zhi-Hong, 2010, Li and Deng, 2010). Moreover, frequent patterns play an essential role into some methods of other data mining tasks like classification (Hernndez-Len et al., 2012, Nahar et al., 2013, Nguyen et al., 2012) and clustering (Malik, Kender, Fradkin, & Moerchen, 2010).

What does a frequent pattern mean? It means that the same feature value combination occurs a certain number of times in the dataset. For example, given the dataset described by numerical and not numerical features (mixed data) shown in Table 1, assuming 0.6 as frequency threshold and 0.9 as confidence threshold, the only frequent combination of feature values is (Married = No) which appears 4 times in the 6 objects of the dataset; and there are no interesting association rules.

The concept of similarity or its opposite, the concept of dissimilarity (not necessarily a distance) is a natural tool commonly used in soft sciences to make decisions (Geology Gómez, Rodríguez, Valladares, & Ruiz-Shulcloper, 1994, Medicine Ortiz-Posadas, Vega-Alvarado, & Toni, 2009, Sociology Ruiz-Shulcloper & Fuentes-Rodríguez, 1981, etc.). If a similarity function different from the equality is employed, a frequent pattern, also called frequent similar pattern (Rodríguez-González, Martínez-Trinidad, Carrasco-Ochoa, & Ruiz-Shulcloper, 2008), is a combination of feature values of the study objects, such that, the similarity accumulation of its similar patterns is not less than an user-specified frequency threshold.

Considering the last frequent similar pattern definition and supposing that: (I) two ages are similar if the absolute value of their difference is at most 5 years; and (II) compact cars are similar to medium cars, medium cars are similar to big cars; big cars are similar to fancy cars; then the frequent similar patterns and the interesting association rules mined from Table 1 as well as their frequency and confidence values would be those shown in Table 2.

As it can be noticed, the use of a similarity function different from the equality (between feature values and object descriptions) produces frequent patterns and interesting association rules which are hidden for algorithms that use the equality as similarity function.

Preliminar results of this paper were presented in Rodríguez-González et al. (2008). In the present work, we focused on association rule mining using similarity functions on mixed data. This process is divided in two steps: (I) frequent similar pattern mining; (II) generation of interesting association rules from frequent similar patterns. For the first step we propose two algorithms: One for similarity functions that hold the f-downward closure property and other for similarity functions that do not hold this property. For the second step we propose an adaptation of the GenRules Algorithm (Agrawal & Srikant, 1994). The main differences of this paper with the conference paper are that here (I) we formalize and proof the properties in which our frequent similar pattern mining algorithms are based, (II) we evaluate the quality of the mined patterns, and (III) we propose an adaptation of the GenRules Algorithm (Agrawal & Srikant, 1994) for computing association rules from frequent similar patterns.

It is important to highlight that in Rodríguez-González, Martínez-Trinidad, Carrasco-Ochoa, and Ruiz-Shulcloper (2011) a frequent similar pattern mining algorithm (called RP-Miner) for Boolean similarity functions that does not meet the Downward Closure property was proposed. Even though the experimental results reported in Rodríguez-González et al. (2011) show that RP-Miner is more efficient than the algorithm proposed in this paper for similarity functions that does not fulfill the Downward Closure property (STreeNDC-Miner) and more effective than the algorithm proposed in this paper for similarity functions that fulfill the Downward Closure property (STreeDC-Miner). In Rodríguez-González et al. (2011) it is also shown that in those problems where we know that the similarity function fulfills the Downward Closure property, STreeDC-Miner is faster that RP-Miner. While in those problems where we know that the similarity function does not fulfill the Downward Closure property STreeNDC-Miner finds all the patterns while RP-Miner finds only a subset. Therefore the algorithms proposed in this paper constitute an alternative to those cases where RP-Miner does not provide good results. Therefore, the results presented in this paper complete the study of algorithms for mining frequent similar patterns on mixed data using Boolean similarity functions different from the equality.

The outline of this paper is as follows. In Section 2 related works are reviewed. Section 3 provides basic concepts. Section 4 describes the proposed similar frequent pattern mining algorithms. In Section 5 we adapt the GenRules Algorithm for computing association rules from frequent similar patterns. Finally, in Sections 6 Experimental results, 7 Conclusions experimental results and conclusions are respectively exposed.

Section snippets

Related work

ObjectMinerDánger, Ruiz-Shulcloper, and Berlanga (2004) was the first algorithm that used similarity functions for mining frequent patterns. In order to allow pruning the search space of frequent similar patterns, this algorithm was designed for similarity functions that hold: if two objects are not similar with respect to a feature set S then they are not similar with respect to any superset of S. ObjectMiner was inspired in the Apriori Algorithm (Agrawal & Srikant, 1994). It works following a

Basic concepts

Let Ω = {O1,O2,  ,On} be a dataset. Each object O is described by a set of features R = {r1,r2,  ,rm} and represented as a tuple (v1,v2,  ,vm) where vi  Di (Di is the domain of the feature ri) (1  i  m). A subdescription of an object O for a subset of features S   R denoted as IS(O), is the description of O in terms of the features in S; O[r] denotes the value of O in the feature r  R; and fS(O,O′) denotes the similarity between O and O′ using their subdescriptions IS(O) and IS(O′) respectively (

Frequent similar pattern mining

The downward closure property has been used in frequent itemset mining for pruning the search space (Agrawal & Srikant, 1994). This property ensures that all supersets of a non-frequent itemset are also non-frequent itemsets. An analogous downward closure property, for mining frequent similar patterns, can be expressed as follows: all superdescriptions of a non-f-frequent subdescription are also non-f-frequent subdescriptions. We call f-downward closure property to this property.

Given a dataset

Generating interesting association rules

The use of similarity functions different from the equality for computing the frequency of the subdescriptions allows to find interesting association rules hidden when the equality is used as similarity function. Additionally, when the equality is used instead a similarity function different from the equality, false association rules could be generated. A false association rule is a rule that using the genuine similarity function to compute its frequency and its confidence, results non

Experimental results

In this section, we compare STreeDC-Miner  + FSP-GenRules (STDC + GR) and STreeNDC + FSP-GenRules (STNDC + GR) algorithms against the ObjectMiner  + FSP-GenRules (ObjMiner + GR) algorithm (provided by its authors) (Dánger et al., 2004). We conducted three experiments. In the first experiment (Section 6.1), we evaluate the performance of the proposed algorithms using a similarity function that satisfies the f-downward closure property. The comparison of the algorithms is in terms of the time needed to mine

Conclusions

In this paper, we focused on the problem of mining frequent patterns and association rules using similarities. We introduce several properties and proved several propositions that allow pruning the search space of frequent similar patterns. Based on these properties and propositions an efficient data structure to store all necessary information about object subdescriptions and their similarities was introduced. Also, a novel and efficient algorithm for mining frequent similar patterns for

Acknowledgements

This work is partly supported by the National Council of Science and Technology of Mexico under the project CB2008-106443 and Grant No. 32086.

References (35)

  • M.R. Ortiz-Posadas et al.

    A mathematical function to evaluate surgical complexity of cleft lip and palate

    Computer Methods and Programs Biomedicine

    (2009)
  • D. Sánchez et al.

    Association rules applied to credit card fraud detection

    Expert Systems with Applications

    (2009)
  • Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. Research Report RJ...
  • Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In...
  • R. Agrawal et al.

    Fast algorithms for mining association rules

  • G. Chen et al.

    Fuzzy association rules and the extended mining algorithms

    Information Sciences

    (2000)
  • Choi, D.W., & Hyun, Y.J., (2010). Transitive association rule discovery by considering strategic importance. In 2010...
  • Cited by (24)

    • Closed frequent similar pattern mining: Reducing the number of frequent similar patterns without information loss

      2018, Expert Systems with Applications
      Citation Excerpt :

      However, real life objects, such as objects in sociology (Ruiz-Shulcloper & Fuentes-Rodríguez, 1981), geology (Gómez-Herrera, Rodríguez-Morn, Valladares-Amaro et al., 1994), medicine (Ortiz-Posadas, Vega-Alvarado, & Toni, 2009) or information retrieval (Baeza-Yates, Ribeiro-Neto et al., 1999)), are rarely equal or they can be described by non boolean features. Thus, similarity functions different from the exact matching were proposed to compare object descriptions giving rise to a new approach named frequent similar pattern mining which can handle datasets containing non boolean features by using similarity functions (Danger, Ruiz-Shulcloper, & Llavori, 2004; Rodríguez-González, Martínez-Trinidad, Carrasco-Ochoa, & Ruiz-Shulcloper, 2008; 2011; 2013). This approach produces patterns which can not be found by those algorithms based on exact matching.

    • Summarizing scale-free networks based on virtual and real links

      2016, Physica A: Statistical Mechanics and its Applications
    • Pattern recognition in Latin America in the "big data" era

      2015, Pattern Recognition
      Citation Excerpt :

      A set of frequent subgraphs is obtained with an FSM algorithm and then used as input features to an SVM classifier. Other articles, related to mining frequent similar patterns, are presented in [126,125]. Other articles in mining frequent similar patterns were presented in [89,100].

    • Association rule mining with mostly associated sequential patterns

      2015, Expert Systems with Applications
      Citation Excerpt :

      Jin, Wang, Huang, and Hu (2014) employed causality between antecedent and consequent to discover interesting rules; they used causality as an objective measure. The frequent itemsets and useful rules are explored by similarity instead of attribute–value equivalence in Rodríguez-González, Martínez-Trinidad, and Carrasco-Ochoa (2013). They adapted the algorithm proposed in Agrawal and Srikant (1994) to generate interesting rules.

    View all citing articles on Scopus
    View full text