An Evolutionary Algorithm to Mine High-Utility Itemsets

High-utility itemset mining (HUIM) is a critical issue in recent years since it can be used to reveal the profitable products by considering both the quantity and profit factors instead of frequent itemset mining (FIM) of association rules (ARs). In this paper, an evolutionary algorithm is presented to efficiently mine high-utility itemsets (HUIs) based on the binary particle swarm optimization. A maximal pattern (MP)-tree strcutrue is further designed to solve the combinational problem in the evolution process. Substantial experiments on real-life datasets show that the proposed binary PSO-based algorithm has better results compared to the state-of-the-art GA-based algorithm.


Introduction
Knowledge discovery in database (KDD) is an emerging issue since the potential or implicit information can be found from a very large database.Most of them, frequent itemset mining (FIM) or association-rule mining (ARM) has been extensively developed to mine the set of frequent itemsets in which the occurrence frequency of items is no less than minimum support threshold or the confidence of items is no less than minimum confidence threshold [2], [5].Since only the occurrence frequencies of itemsets are discovered whether in FIM or ARM, it is insufficient to identify the high profit item/sets especially when the itemsets are rarely appeared but have high profit values.To solve the limitation of FIM or ARM, high-utility itemset mining (HUIM) [20], [21], [22] was designed to discover the "useful" and "profitable" itemsets from the quantitative databases.Chan et al. [4] first mentioned utility mining problem instead of FIM.Yao et al. [20] concerned the quantity of items as the internal utility and the unit profit of items as the external utility to discover the HUIs.Liu et al. designed the two-phase (TWU) model and developed the transaction-weighted downward closure (TWDC) property for mining HUIs [16].Lin et al. developed the condensed high-utility pattern (HUP)-tree and related algorithm for discovering HUIs [14].Lan et al. designed the mining algorithm based on index-projection mechanism and developed the pruning strategy to efficiently mine the HUIs [10].An efficient list-based algorithm (HUI-Miner) was also proposed to mine the HUIs without candidate generation [15].Fournier-Viger et al. adopted the HUI-Miner algorithm and presented a FHM algorithm for mining HUIs based on the constructed structure of 2itemsets [7], which is the state-of-the-art algorithm in traditional HUIM.
The traditional algorithms of HUIM have to handle the "exponential problem" of a very huge search space while the number of distinct items or the size of database is very large.Evolutionary computation is an efficient way and able to find the optimal solutions using the principles of natural evolution [3].The genetic algorithm (GA) [11] is an optimization approach to solve the NP-hard and non-linear problems and used to investigate a very large search spaces to find the optimal solutions based on the designed fitness functions with various operators such as selection, crossover, and mutation.In the past, Kannimuthu and Premalatha adopted the genetic algorithm and developed the high utility pattern extracting approach using genetic algorithm with ranked mutation using minimum utility threshold (HUPE umu -GRAM) to mine HUIs [9].Instead of GAs, Particle Swarm Optimization (PSO) [12] is also a bio-inspired and population-based approach for finding the optimal solutions by adopting the velocity to update the particles.In this paper, a binary PSO-based (BPSO) [13] algorithm with an improved maximal pattern (MP)-tree structure is designed for mining the HUIs.The major contributions of this paper are described below: • We have presented a discrete PSO-based algorithm to mine high-utility itemsets (HUIs) based on the sigmoid updating strategy.The TWU model is also used in the designed algorithm to reduce the size of each particle, thus speeding up the the combinational phase for revealing the HUIs in the evolution process.
• An efficient maximal pattern (MP)-tree structure is developed to reduce the multiple database scans by early pruning the invalid combinations of the particles.
• Extensive experiments showed that the proposed approach has better results compared to the stateof-the-art GA-based algorithm [9] whether in runtime or the number of discovered HUIs.

Particle Swarm Optimization
In the past, many heuristic algorithms have been facilitated to solve the optimization problems for discovering the necessary information in the evolutionary computation [3].The simple genetic algorithm (sGA) [11] is a fundamental search technique to find the feasible and optimal solution in a limit amount of time, which was inspired by the Darwinian principles.Many variants of GAs have been extensively studied and applied to a wide range of optimization problems [9], [19].
Kennedy and Eberhart first introduced Swarm Optimization (PSO) [12] in 1995, which was inspired by the flocking behavior of birds to solve the optimization problems.Instead of GA, each particle has memories to keep its previous best particle (personal best, pbest) and its previous best particle by considering its neighborhoods (global best, gbest).
The PSO was originally defined to solve the continues valued spaces.In real-world situations, many problems are set as the discrete variable spaces such as scheduling and routing problems.Kennedy and Eberhart then also designed a discrete (binary) PSO (BPSO) [13] to solve the limitation of continuous PSO.Sarath and Ravi also designed a BPSO optimization approach to discover ARs [18].Other applications by adopting PSO to mine the required information are still processed in progress [8], [17].

High-Utility Itemset Mining
High-utility itemset mining (HUIM) is an emerging topic in recent decades.Chan et al. first developed a mining framework for discovering the top-k closed utility patterns [4].Yao et al. then designed an approach to discover the profitable itemsets by considering both the purchase quantity (also considered as internal utility) and profit (also considered as external utility) of items to reveal HUIs [20], [21].Liu et al. then developed a two-phase (TWU) model and designed the transaction-weighted downward closure (TWDC) property to early prune the unpromising HUIs for discovering HUIs [16].Erwin et al. presented a parallel projection scheme and used the disk storage for handling the large-scale databases and the designed approach performs well whether in dense or sparse databases [6].Lan presented a projected-based approach to level-wise reduce the size of the processed databases for mining HUIs [10].Fournier-Viger et al. [7] proposed a FHM algorithm to keep the relationships of 2-itemsets, thus reducing the computations for mining the HUIs based on HUI-Miner algorithm [15], which is state-of-the-art algorithm in traditional HUIM.
Instead of traditional HUIM, Kannimuthua and Premalatha first designed the GA-based algorithm to mine HUIs with the ranked mutation [9], which is the stateof-the-art algorithm for mining HUIS in the evolution process.In their approach, it is not easier to find the initial 1-HTWUIs as the chromosome to find HUIs.A very huge computation is necessary to initially set the appropriate chromosomes.Moreover, some specific parameters are required in the evolution process of GAs.In this paper, a binary PSO-based algorithm with an efficient maximal pattern (MP)-tree structure are designed to avoid the invalid combinations, thus improving the efficiency for discovering HUIs.

Proposed Algorithm
In the designed binary PSO (BPSO)-based model [13], it consists of pre-processing, particle encoding, fitness evaluation and the updating processes to mine HUIs.
In the first pre-processing process, the high-transaction-weighted utilization 1-itemsets (1-HTWUIs) are first discovered based on TWU model [16].The valid combinations of the itesmets in the database is also compressed as a maximal pattern (MP)-tree structure, which can be used to avoid the invalid generations of the particles in the updating process.It uses the OR and NOR operators to determine whether the combined itemsets can generate the valid combinations in the database.A simple MP-tree structure is shown in Fig. 1.In the second particle encoding process, the items of 1-HTWUIs are sorted in their alphabetic-ascending order corresponding to the j -th position of a particle.The particle is thus encoded as the set of binary variables corresponding to the sorted order of 1-HTWUIs.In the fitness evaluation, the particles are then evaluated to find the pbest and gbest particles in the evolution process.For the last updating process, the particles are correspondingly updated by velocities, pbest, gbest, and the sigmoid function.The MP-tree structure is used here to generate only the valid combinations of the particles.If the fitness value of the particle is no less than the minimum utility count (the minimum utility threshold multiplied by the total utility), it is concerned as a HUI and put into the set of HUIs.In the designed algorithm, the fitness function is the same as the traditional HUIM as: in which X is the union of j -th positions in the particle p i .This iteration is repeated until the termination criteria is achieved.After that, the set of HUIs is discovered.Details of the proposed algorithm is shown below.In the designed algorithm, the set of 1-HTWUIs is thus discovered based on two-phase model [16] (Line 1).The number of 1-HTWUIs is thus used as the particle size, which can reduce the combinational problem in traditional HUIM [20], [21] (Line 2).The particles are initialized by either 0 and 1 based on the binary PSO approach [13] (Line 3).The velocities of particles are randomized in the range of (0, 1) (Line 4).After that, the pbest and gbest are also initialized in the same way (Line 5).The velocities of particles are thus updated according to the traditional PSO approach [12].The particles are then, updated based on the sigmoid function used in [13] and the designed MP-tree structure (Line 8).In this procedure, the invalid particles (not existing in the original database) cannot be generated, which can greatly reduce the computations of multiple database scans.The fitness values of the updated particles are then determined to find the actual HUIs (Lines 10 to 11).After that, the pbest and gbest of the current particles are then updated for next iteration (Lines 12 to 13).This procedure is repeatedly progressed until the termination criteria is achieved (Lines 6 to 14).After that, the set of the discovered HUIs is outputted (Line 15).

Experimental Results
Substantial experiments were conducted to verify the effectiveness and efficiency of the proposed algorithm compared to the state-of-the-art evolutionary HUPE umu -GRAM algorithm [9].The algorithms in the experiments were implemented in C++ language, performing on a PC with an Intel Core2 i3-4160 CPU and 4 GB of RAM, running the 64-bit Microsoft Windows 7 operating system.Two real-world datasets called chess [1] and mushroom [1] are used in the experiments.A simulation model [16] was developed to generate the quantities and profit values of items in transactions for all datasets.A log-normal distribution was used to randomly assign quantities in the [1,5] interval, and item profit values in the [1,1000] interval.In the conducted experiments for mining HUIs, the performed algorithms are all performed for 10,000 iterations and the population size is set as 20.The experiments are executed five times and the results are the average values of them.The algorithms are then compared in terms of execution time and number of HUIs.Note that the runtime of the designed algorithm includes the construction time of the designed MP-tree structure.The parameters of the designed algorithm are respectively set as: c 1 = c 2 (= 2) and w(= 0.9).

Runtime
In the conducted experiments of runtime in two datasets, the state-of-the-art evolutionary HUPE umu -GRAM algorithm of HUIM is compared to the designed algorithm.The TWU model is an algorithm in traditional HUIM for mining HUIs, which is insufficient and inappropriate to compare with designed algorithm in term of runtime.The results are shown in Fig. 2.
From Fig. 2, it can be seen that the proposed algorithm has better performance in terms of execution time compared to the HUPE umu -GRAM algorithm w.r.t.different minimum thresholds.The reason is that the proposed algorithm only generates the valid combinations of itemsets existing in the database, which can greatly avoid the combinational problem in the evolution process.The HUPE umu -GRAM generates, however, many unpromising itemsets.The multiple database scans are required to determine the invalid itemsets.This process takes very huge computations and non-HUIs can thus be generated from the database.When the minimum utility threshold is set lower, more high transaction-weighted utilization 1-itemsets (1-HTWUIs) are discovered; it takes more combinations to find the promising HUIs in the evolution process.Overall, the proposed algorithm can solve the combinational problem in the evolution process compared to the state-of-the-art HUPE umu -GRAM algorithm.

Number of HUIs
In this section, the number of HUIs is evaluated to show the performance of the compared algorithms.The  state-of-the-art two-phase [16] algorithm is used to discover the actual and complete HUIs from the quantitative databases.Experiments are conducted and shown in Fig. 3.
From Fig. 3, it can be seen that the proposed algorithm can generate nearly the same number of HUIs compared to the state-of-the-art two-phase algorithm especially when the minimum utility threshold is set higher.The reason is that the size of a particle is associated with the number of 1-HTWUIs, less computations are required when the minimum utility threshold is set higher.For more condense dataset such as chess or mushroom datasets, the number of 1-HTWUIs is close to the number of discovered HUIs under higher minimum utility threshold; the number of generated HUIs of the designed algorithm is close to the traditional way for mining the complete HUIs.The designed algorithm produce less number of the HUIs compared to the TWU model under lower utility threshold.This is reasonable since the proposed algorithm is an evolutionary way to learn and discover the HUIs; when the minimum utility threshold  is set lower, more high transaction-weighted utilization 1-itemsets (1-HTWUIs) are discovered but it is not easy to produce the same number of HUIs in the evolution process.However, the proposed algorithm outperforms the state-of-the-art HUPE umu -GRAM algorithm, which can be found in Fig. 3(a) and Fig. 3(b).This is reasonable since the GA-based algorithm is barely to find the promising HUIs but the designed algorithm can efficiently find the valid HUIs by concerning both the local and global results.

Conclusions
In this paper, we first propose a binary particle swarm optimization (BPSO)-based algorithm to efficiently mine high-utility itemsets (HUIs).The contributions of this paper are as follows.First, we adopted the discrete mechanism and set the size of each particle as the number of discovered high-transaction-weighted utilization 1-itemsets (1-HTWUIs) based on TWU model.This approach can greatly reduce the combinational prob-

Fig. 3 :
Fig. 3: Number of HUIs of the compared algorithms.