SSUP-Growth: A Novel Mining High Utility Algorithm Itemset with Single-Scan of Database

High Utility Itemset Mining (HUIM) alludes to the identification of itemsets of high utility in the value-based database UP-Growth algorithm is a standout amongst the best algorithms for overcome the challenge of candidate generation and scan database reputedly of previous algorithms. However, it needs scan database twice to actualize the UP tree. Regarding of the updating existing data with new information, UP-growth needs for twofold scanning of new information and existing information. The fundamental motivation behind this work is to build up another algorithm, Single-Scan Utility Pattern Tree (SSUP-tree), for mining high utility itemsets from transaction database through only single-scan of database. In our algorithm, the details of high-utility itemsets is preserved in a particular data structure of the SSUP-Tree after a single-scan of database. Consequently, it can retrieve the identical UP-tree with a fixed minimum utility threshold. The proposed algorithm required to scan the new data only to update SSUP-tree. In this regard, in order to estimate the execution of the proposed algorithm, the SSUP-tree algorithm has been implemented on synthetic and real datasets. The results of this study revealed that SSUP-tree shows a significant enhancement in the execution in terms of runtime since it keeps the huge databases details in a compact format and it avoids repetition of database scanning.


Introduction
A framework of HUIM has emerged to address the limitation of Frequent Itemset Mining (FIM), which only considers the frequency of the items without regard to the utility of items such as profit and quantity. HUIM has many scientific and commercial applications and is one of the topics of recent active research [1] [2] [3] [4] [5]. The importance of HUIM framework is to provide the decision makers with greater flexibility to reflect the utility of items, such as profit and quantity. This allows more interesting and feasible patterns to be generated In HUIM framework, each item has two values to determine its utility, external utility which represents the significance of the items that given separately by the users and the internal utility which represents the significance of the item in the transaction. Several methods have been applied in previous research [2,4,5,6,8] to enhance the performance of utility mining. Arguably, Regarding the method used, the HUIM framework approaches can be further classified as follows: level-wise approaches [1,6,7], Tree-based (pattern-growth) approaches [8,9,10,11,2,12], Projection-based approaches [4,13], and utility list based methods [3,14,5,15].
Tree-based methods such as Utility Pattern Growth (UP-Growth) [11] seeks to overcome dilemma of the candidates generation. Nonetheless, this algorithm still needs to scan the  [11] needs to re-implement the whole process for new and old database together once more to update the rules. In other words, it ignores the past double scan of database and start the whole process again. In this paper, we propose a new method, SSUP-Growth, to generate High Utility Itemset with only single-scan of database. In case of updating the database, our proposed method needs a single-scan of the new data only.
Based on the literature aforementioned and the best knowledge of the authors, UP-growth approach is one of the best approaches that avoids the costly candidates generation. However, it is still needs to scan database double time. Furthermore, it handles the update of the database by rescan the old and the new data double time again. This motives the current work to create a novel algorithm that needs single-scan of database and handles the update of database by single-scan of the new data only.
The rest of this paper is structured in such a way: in Section 2, we present utility itemset mining related work. In Section 3, we explain the relative definitions and the problem statement. The proposed algorithm and the data structure are comprehensively introduced in Section 4. Finally, section 5 concludes this work.

Problem Definition
Let I= {y 1 , y 2 , ..., y m } represent a limited set of distinct items, and D= {T 1 , T 2 , ..., T n } depict a set of transactions, where each transaction T i (1 ≤ i ≤ n portrays a subset of I and has a distinctive identifier i known as Tid. Any subset of I includes K items {y 1 , y 2 , ..., y k } where y j ∈ I, 1 ≤ j ≤ k is named as the k-itemset. Table 1 provides an example transaction database D, and Table 2 lists the items of external profit details. Definition 1. Item internal utility refers to the significance of item y j in the transaction T i such as quantity provided in the transaction and denoted as Q(y j , T i ). For example in Table 2, Definition 2. Item external utility presents the prominence of item y i provided in the utility table, such as profit. For example, in Table 2, P (A) = 5.
Definition 3. The utility of an item y j ∈ T i remains distinct asU (y j , T i is calculated as the product of the internal and external profit of items in the transaction, U (y j , T i ) = P (y j ) * Q(y j , T i ) For example, in Table 1, U (D, T 1 ) = 2 * 1 = 2. Definition 4. The utility of an itemset Y in T i is denoted by U (Y, T d ) and defined by, Table 1, U (AC, T 1 ) = U (A, T 1 ) + U (C, T 2 ) = 5 + 1 = 6. Definition 5. The utility of the itemset Y in D is signified by U (Y ) and defined by Table 1 For the illustrative example in Table 1, when β = 20%, the absolute minimum utility value is min util = 0.20 * 71 = 42.
Definition 7. The HUI depicts the itemset with the utility value greater than or equal to the user-defined minimum utility threshold (min util). Otherwise, the itemset reveals an LUI.
Definition 8. The transaction utility of a transaction T i is portrayed by T U (T i ) and specifies the total utility of all items it contains. It is defined by, For example,in Table 1, Definition 9. The transaction-weighted utilization of the itemset Y represents the sum of the transaction utilities of all the transactions including Y , indicated as T W U (Y ) and displays as, is not smaller than the minimum utility threshold, Y is considered as a high transaction-weighted utilization itemset (HTWUI).
Definition 10. The transaction-weighted downward closure, (T W DC for short), is declared as follows: For any itemset Y , if Y does not equal an HTWUI, any superset of Y is not HTWUI, too. By this definition, we preserve the downward closure by employing transaction-weighted utilization. For example, in Table 1, since T W U (AC) > min util, any superset of AC depicts a high utility itemset.

The Components of SSUP-Tree
In SSUP-Tree, the root node is labelled as Null. Each other node N contains five components, N.parent, N.name, N.nu, N.count, and a set of children nodes, where N.parent designates the parent of N, N.name denotes the original item in the transaction database, N.nu refers to the utility value, and N.count represents the support count. In SSUP-Tree, each path represents one or more transaction(s), and its corresponding utility(s) is considered as the utility of its least high utility item(s). The utility of a given node is greater than the utility of its children or descendants.
A prefix in SSUP-tree represents the common identical patterns appeared in several transactions.
Instead of the Header Table in UP-Growth, in our proposed method there is an Item Utility List (IUL) to record the actual utility value for each item. IUL contains the names of items and their estimated utility values. Moreover, there is a High Utility Item List (HUIL), which includes the items with high utility only.

Building the SSUP-Tree
Our proposed approach constructs a compressed tree structure by taking advantage of the best usage of the appearance of the common prefixes in the transactions and the similarity of the transactions. As mention above, the primary SSUP-tree contains all of the items occur in the database. The SSUP-tree is built in two steps (see Algorithm 1): • In the initial (and only) scan, the transactions will be inserted based on predefined item order (such as lexicographical order) one by one into the empty SSUP-Tree. Meanwhile, recording the actual utility value TWU of each item into the IUL. • IUL is ascendingly sorted according to the items utility. Next, the nodes in each path are descendingly resorted according to their utility values recorded in IUL. Afterward, the sorted paths are inserted into a new tree (NSSUP-tree).

Achieving the UP-tree from the SSUP-tree
Based on the characteristics of the SSUP-tree, it is worth to mention that the SSUP-tree is included the high and low utility items. Thus, the UP-tree is a sub-tree of the SSUP-tree with a fixed (min util) threshold, which involves the high utility items that meet the (min util) threshold. In this regard, the mechanism of the UP-tree generation from SSUP-tree is proposed and analyzed in this section.
After constructing the primary SSUP-tree, the High Utility Items list (HUIL) is simply obtained according to a given specific (min util). All we need to do is discard the unpromised (low utility) items from IUL. Afterward, we examine the utility of each node in the paths, from the root until the leaves, in order to trim the primary SSUP-tree by eliminating the unpromised nodes. Since the children or its descendant have less utilities than their parents, the unpromised nodes will be eliminated with its subtree.

Update the SSUP tree with new data
Since SSUP-tree contains all the items in the database. Therefore, SSUP tree can be updated by scanning new data without additionally scanning old data. The update process runs in two steps. First, the new transactions are inserted into the primary SSUP-tree based on IUL, and meanwhile the IUL is updated. Second, the updated SSUP-tree is rebuilt based on the updated IUL. In the case of a new transaction containing new items, which do not appear in the existing database, it can be handled as a new branch of the main SSUP tree with its actual utility values.

Settings
This section shows our experiments results after examining the execution of our algorithm and comparing it to the UP-Growth algorithm [11]. The experiments have been executed on a 2.66 GHz Intel Core 4 Quad Processor with 4G memory, on operation system Windows 7. The code of the algorithms are developed in Java. Furthermore, the algorithms performance is estimated by utilizing synthetic and real datasets. Moreover, T10I6D100K and chess datasets published in [16] have been considered in this work. Both of the datasets used in this study already contained unit profits and purchase quantities. Table 3 shows the properties of the datasets.

Evaluation on Synthetic and Real Datasets
First, we use the synthetic dataset T10I6D100K to show the evaluation results. We employed five minimum utility thresholds, 40%, 50%, 60%, 70%, and 80%. The evaluation results of the phase I (i.e., tree construction) on the dataset T10I6D100K are plotted in Figure 1 the results showed that our algorithm takes less execution time as it needs single-scan of dataset. On the other hand, the execution time of phase II (i.e., get the corresponding UP-tree) is shown in Figure 2. At the beginning, both algorithms generate the corresponding UP-tree for the first minimum utility threshold (40%). After that, when the minimum utility threshold increases from 40% to 50% or 70%, the corresponding UP-tree can easily obtained through trimming the UP-tree with the minimum utility threshold of 40% with no need to scan the dataset again. However, when the minimum utility threshold has diminished, UP-Growth algorithm ignored the past double  Figure 2, when the (min util) threshold decreased from 80% to 70%, UP-Growth spends more time to get the corresponding UP-tree. On the other hand, our algorithm gets the corresponding UP-tree by trimming the primary SSUP-tree, regardless how the minimum utility thresholds are modified, without having to re-scan the dataset. Second, we evaluated the execution time of UP-growth and SSUP-Tree for Phase I and Phase II on the chess dataset. Figure 3 and Figure  4 show the runtime of both algorithms for Phase I and Phase II on the dataset chess using different minimum utility thresholds. It is clear that our proposed algorithm spends less time in each phase for the same reason explain in the first experiment.

Scalability
In this subsection, we provide experiments of evaluate scalability of our proposed algorithm. The dataset size of the T10I6 dataset has been varied to evaluate the scalability of UP-Tree and SSUP-tree algorithms. Figure 5 and 6 show the performance of both methods on different size of dataset and with (min util) is 85%. As shown in Figure 5 and 6, the execution time of the SSUP-Tree is less than the UP-tree. In case of the database is size increased, the execution time for distinguish high-utility itemsets also increases. Consequently, the UP-Growth algorithm requires more processing time than the SSUP-tree.

Conclusions
In this paper, we present an algorithm called SSUP-tree for mining efficient itemsets from transactional databases. In addition, we discussed how the SSUP-tree is obtained by a single scan of the database and how it is updated by a single scan of the new data. In order to evaluate the runtime of the proposed algorithm, SSUP-tree algorithm and UP-growth implemented on real data sets and synthetic datasets. The experimental results show that our proposed algorithm, SSUP-tree, continuously exceeds the UP-tree algorithm in the term of runtime because SSUPtree preserves the information of items in a compact structure and avoids scanning the database again, especially when the database contains a large number of transactions with shared prefixes. However, the limitation of the proposed method is that it based on a single minimum threshold, which has an adverse effect on the mining results. Therefore, the future work is to improve this approach to mine high utility itemsets based on multiple minimum thresholds.