An Improved Algorithm for Mining Correlation Item Pairs

: Apriori algorithm is often used in traditional association rules mining, searching for the mode of higher frequency. Then the correlation rules are obtained by detected the correlation of the item sets, but this tends to ignore low-support high-correlation of association rules. In view of the above problems, some scholars put forward the positive correlation coefficient based on Phi correlation to avoid the embarrassment caused by Apriori algorithm. It can dig item sets with low-support but high-correlation. Although the algorithm has pruned the search space, it is not obvious that the performance of the running time based on the big data set is reduced, and the correlation pairs can be meaningless. This paper presents an improved mining algorithm with new association rules based on interestingness for correlation pairs, using an upper bound on interestingness of the supersets to prune the search space. It greatly reduces the running time, and filters the meaningless correlation pairs according to the constraints of the redundancy. Compared with the algorithm based on the Phi correlation coefficient, the new algorithm has been significantly improved in reducing the running time, the result has pruned the redundant correlation pairs. So it improves the mining efficiency and accuracy.


Introduction
In practice, the association rules ( ) A B ⇒ have the following four combinations between support and correlation: Low-Low, High-High, Low-High, High-Low. The Apriori-based association rules mining algorithm tends to ignore the Low-High rules, and it is also very research-intensive in practice. The purchase records that are often rare are more interesting than those that occur frequently, such as the purchase analysis of luxury goods in shopping malls. Therefore, mining Low-High rules is sometimes more valuable than mining high-support rules. The traditional association rules mining algorithm, such as Apriori, usually find frequent item sets in Xu et al. [Xu and Dong (2013); Rameshkumar, Sambath and Ravi (2013); Poundekar, Manekar, Baghel et al. (2014); Yuan, Li and Chen (2016); Tandon, Haque and Mande (2016)] generate the association rules based on item sets. The time and space complexity of the first step is much higher than the second step. Tang et al. [Tang, Xu and Duan (2018)] reduce the complexity of the space. Said et al. [Said, Guillet, Richard et al. (2013)] propose a new association rules of the correlations between item pairs. Feng et al. [Feng, Zhu and Zhang (2016)], MH-Apriori optimizes Apriori algorithm that can improve the efficiency of Apriori algorithm for mining frequent item sets. Pandagale et al. [Pandagale and Surve (2016)], in order to find association rules, the Apriori MapReduce algorithm can be used to better achieve space and time complexity. Xue et al. [Xue, Song, Qin et al. (2015)] propose a mutual-information-based quantitative association rule-mining algorithm (MIQarma) to address traditional approaches to spatiotemporal analysis challenges. Poundekar et al. [Poundekar, Manekar, Baghel et al. (2014)] propose association rule mining and it can reduce the scanning time of database. The classic model of association rules mining is based on support and confidence metrics, Thangarasu et al. [Thangarasu and Sasikala (2014)] use Tree-based Association Rules. Li et al. [Liu and Wang (2013)] propose an association rule mining algorithm and it based on the formal concept analysis that can improve the efficiency of algorithms. Tempaiboolkul [Tempaiboolkul (2013)] propose an algorithm for discovering rare association rules in distributed environment and it can achieve an optimized function. Jiang et al. [Jiang, Luan and Dong (2012)] propose a multi-support (WNAIIMS)-based invariant item set weighted negative association rules mining algorithm. Quan et al. [Quan, Liu, Chen et al. (2012)] propose a new mining frequent item sets algorithm based on matrix and experimental result improves the efficiency. Qian et al. [Qian, Jia, Zhang (2008); Luo and Li (2014)], the improved Apriori algorithms are proposed to improve the efficiency of traditional algorithms. The matrix method is used to scan the database once, and it can optimize the operation and improve the mining efficiency. Although it is relatively simple to extract association rules from frequent item sets, it is easy to produce meaningless misleading rules. Ravi et al. [Ravi and Khare (2014)] propose an Efficient and Optimized Association Rules Mining algorithm EO-ARM. It can increase the efficiency by scanning the data set only once. Yang et al. [Yang, Huang and Jin (2017)] presented an improved algorithm that reduces the time to scan the transaction database while preserving the effect of complete mining, which reduces the running time and improves the efficiency of mining. Davale et al. [Davale and Shende (2015)] use logic to generate the association rules and there is no need to decide value of threshold. In Chen et al. [Chen and Gao (2011)], based on the generation of association rules by frequent item sets, correlation metrics are used to test the rules and to avoid the occurrence of misleading rules. However, the correlation metric introduced in the paper are asymmetrically distributed on both sides with a threshold of 1, and its value does not reflect the correlation strength of the association rules. Su et al. [Su and Guo (2014)] propose an interestingness model based on cosine metric, which makes up for the lack of asymmetry of the probability model of the value 1. Lu [Lu (2012)] avoids weak and misleading association rules. All these algorithms use the Apriori method to mine frequent item sets with high frequencies, and ignore infrequent parts which often contain key value information. Juan et al. [Juan, Li and Feng (2015)] propose the research of deleting redundant association rules and it can get frequent association rules. Xiong et al. [Xiong, Shekhar and Tan (2004); Qian, Feng and Wang (2005)] are non-Apriori class algorithms. They use the upper bound of the Phi coefficient to reduce the space, and mine all pairs of positive correlations and all pairs of negative correlations. Compared with the traditional Apriori class methods. The algorithms not only mine High-High item pairs, but also mine Low-High item pairs, and at the same time, they improve time performance. However, the running time performance improvement of the algorithms is not obvious on the big data set, and the generated item pairs may be redundant and interestingness for users. To reduce the redundant information and extract the most distinct features, ROI and PCA operations are performed for learned features of convolutional layer or pooling layer. Yue et al. [Yue, Wang and Wang (2014)] reduce the running time and deletes some redundant rules in mining association rules. In this paper, Algorithm is proposed to mine the pairs of non-redundant positive correlations, and to prune the search space by the upper bound of the interestingness of the superset of the item or item pairs, the algorithm improves the time performance significantly compared with the one based on Phi correlation coefficient. At the same time, the pair of items that are interestingness and redundant for the user are pruned. The arrangement of this paper is as follows. Section 2 introduces conceptual knowledge related to the new algorithm, including support, rules, strong rules, positive association rules, pairs of positive correlations, and pairs of non-redundant items. Section 3 gives knowledge of interest measure, such as interestingness definition, superset interestingness upper bound, interestingness and relevance measure relationship. Section 4 gives the main ideas and algorithm implementation of the new algorithm. Section 5 gives the experimental simulation results and related performance analysis. Finally, Section 6 summarizes the paper and briefly describes the follow-up study work arrangements. , with n data. Each data transaction in the database has T I ⊆ . Now assume that there is an item set A , A I ⊆ , whose support means that in data set D , the transaction data T contains the percentage of item set A . |cov Definition 2.2: Rules: Given a transaction database D and a set I , the expression like is the association rule. The support of the association rule ( is the proportion of the number of records in data set D , which contain both the item set A and is the ratio of the number of transactions containing both item sets A and B to the number of transactions (2) Definition 2.3: Strong rules: Set the minimum support threshold 1 t and the minimum confidence threshold 2 t . When rule ( ) A B ⇒ meets the two minimum thresholds, the association rule ( ) A B ⇒ is the strong rule.
Definition 2.4: Positive association rules: Example 1: This is a relationship between people who love coffee and who love tea. Suppose A means that people who buy tea, B means that people who buy coffee. Assume that the minimum support of the rule is 0.1, and the minimum confidence is 0.6. The following conclusions can be drawn by analyzing the Tab. 1.
⇒ is a strong rule. But on the other hand, the rule ( _ _ _ _ ) do not buy tea buy coffee ⇒ has greater confidence and accuracy, more than 80% , that is, it is more likely to buy coffee without buying tea. Therefore, the strong rule excavated according to the traditional algorithm is wrong at this time and it is a negative association rule. It is impossible to mine the ( _ _ _ _ ) do not buy tea buy coffee ⇒ . To avoid the insufficient of traditional algorithms, the correlation metric can be used here. Definition 2.6: Item pairs without redundancy: A pair of items that satisfy a positive correlation is not necessarily a meaningful pair of items, and what really matters is the pair of items that the user is interested in. If the result of the mining is expected by the user, the item is meaningless to the user. Therefore, the following constraints are met to be a pair of non-redundant pairs of interest to the user. Suppose an item set x has an item i , and if item i contains \ Proof:

Superset interestingness upper bound
where, \ z x y = .
The theorem about the correlation measure (sup( ),sup( ),sup( )) M x y z is as following: Theorem 1: When sup( ) y , sup( ) z and the number of transaction data set n are constant, the correlation measure (sup( ),sup( ),sup( )) M x y z and sup( ) x are proportional to each other.
For the above correlation measure theorems, the upper limit value of the correlation measure can be obtained. There is an item set x , and the superset x′ , x x′ ⊆ , the upper limit value of the correlation measure of the superset x′ of item set x is: (sup( ),sup( ), max(sup({ }))) When sup( ) sup( )

(sup({ })) max
The items is sorted in ascending order of support, then sup({ }) max(sup({ })) j i ≥ , where i x ∈ . Therefore, when the third parameter of the correlation measure M is the smallest, that is max(sup({ })) i , M is the largest, so the maximum value is the upper bound of the item set correlation measure.

Measurement of interestingness and relevance
Interestingness is defined as follows: It is obvious that the interestingness and the correlation are directly proportional to each other in the Eq. (9). Assumed that an item set x could be divided into two parts of y and z , then the upper bound of the measure of interestingness of the superset x′ of the item set x is evaluated when the correlation is equal to maximum value, namely:

Usage of item pairs to supersets interestingness upper bound
The upper bound of interestingness can be used to prune the search space in the algorithm.
When the upper bound of the interestingness of the item { } i or the superset of the item x′ is smaller than the threshold t , the search space of item { } i or item set x may be pruned. The complexity of the algorithm will be reduced.

Item pairs mining algorithm based on interestingness 4.1 Main idea of the algorithm
Based on the redundancy condition limit for the upper bound of interestingness, the algorithm automatically generates the item pair pattern search traversal space and finds the non-redundant positive correlation terms. First, calculate the upper bound of the interestingness level of the superset of each item n I and arrange from the largest to the smallest according to the upper bound value, and prune the items of the superset whose upper bound of interestingness is less than the threshold t , and reduce the space. Then the item set combination extension is performed, and item 1 I and item 2 I are combined into an item pair. First check whether the pair is redundant. If it is redundant and traverses the next pair. Otherwise, calculate the upper bound of the interestingness of the superset of the pair, observe whether it is greater than the threshold t , and if it is not greater, find the next pair of items. If satisfied, continue to calculate whether the two

Time performance analysis
In order to analyze the performance of the algorithm, the algorithm is implemented in matlab and is runned the real data sets. The experimental environment includes: Intel(R) Core(TM) 2 Quad CPU, 2.00 GB memory, and Window 7. The running time results of the algorithm are compared with the algorithm of Xiong et al. [Xiong, Shekhar and Tan (2004)] on experimental data sets. The experimental data sets include T10I4D100K, T40I10D100K, and Kosarak, which are collected from the UCI website and are preprocessed. The characteristics of the data sets are shown in Tab. 2. The running time of the algorithm is compared with the running time of Xiong et al. [Xiong, Shekhar and Tan (2004)] on the five data sets, T10I4D100K, Pumsb, Accidents, Kosarak, T40I10D100K. As shown in Figs. 1-5, the running time of the new algorithm and Xiong et al. [Xiong, Shekhar and Tan (2004)] algorithm decreases as the minimum interestingness threshold increases. However, the running time taken by the new algorithm relative to the previous algorithm is greatly reduced, and the time performance is significantly improved.

Pruning rate
Assume that n is the number of items, and Pairs is the number of the item pairs after pruning in the algorithm, then the algorithm pruning rate can be expressed as follows, As shown in Fig. 6, the pruning rate increase with the increasing of the minimum interestingness threshold on the different experiments.

Number of correlated item pairs
The algorithm proposed in this paper can not only prune the search space, but shorten greatly the running time compared with the Phi correlation coefficient algorithm in Xiong et al. [Xiong, Shekhar and Tan (2004)], the pruning efficiency increases with the increasing of the interestingness threshold. At the same time, the number of correlated item pairs, decreases with the increasing of threshold interestingness. And compared with proposed in the algorithm [Xiong, Shekhar and Tan (2004)], algorithm can retain most of the correlated item pairs, the redundant meaningless ones, it improves efficiency and accuracy of mining method. As shown in Figs. 7 and 8, there are differences between the two algorithms results in the Kosarak and T40I10D100K data sets. It is found that the result item pairs of our algorithm are less than the ones in Xiong et al. [Xiong, Shekhar and Tan (2004)] under the same interestingness threshold constraint. Because it can filter out redundant item pairs, and get the meaningful item pairs results with positive correlations.

Verification analysis
In the above experiments, the number of item pairs mined by the new algorithm is less than the ones of the Phi algorithm. But it is not completely certain that the pruned pair are meaningless. The correctness of the algorithm is verified by comparing the results on the real data sets of Kosarak and T40I10D100K respectively.   . When the threshold is 0.8, 0.9, the new algorithm and the Phi algorithm mine the same result item pairs. After verifying and analyzing the experimental results in detail, it is shown that the new algorithm can prune the redundant pairs of items more efficiently than the Phi algorithm. Therefore, the new algorithm can not only greatly shorten the running time, but also filter out the meaningless item pairs and improve the efficiency.

Conclusions
There are several shortcomings in the Phi correlation coefficient mining algorithm, such as, the time performance is not efficiency enough, the item pairs that are mined may be redundant and interestingness for users. A new interestingness model is proposed in this paper, which can use the superset of interestingness upper bound to prune the search space. Compared with the Phi correlation coefficient algorithm, the time performance is improved, and the meaningless item pairs are filtered according to the redundant constraints. Through experimental verification on the real data set, the mining efficiency and accuracy are indeed improved, i.e., the algorithm is feasible. The follow-up work will extend the algorithm to the mining of the entire frequent item sets.