GTK: A Hybrid-Search Algorithm of Top-Rank- k Frequent Patterns Based on Greedy Strategy

: Currently, the top-rank- k has been widely applied to mine frequent patterns with a rank not exceeding k . In the existing algorithms, although a level-wise-search could fully mine the target patterns, it usually leads to the delay of high rank patterns generation, resulting in the slow growth of the support threshold and the mining efficiency. Aiming at this problem, a greedy-strategy-based top-rank- k frequent patterns hybrid mining algorithm (GTK) is proposed in this paper. In this algorithm, top-rank- k patterns are stored in a static doubly linked list called RSL, and the patterns are divided into short patterns and long patterns. The short patterns generated by a rank-first-search always joins the two patterns of the highest rank in RSL that have not yet been joined. On the basis of the short patterns satisfying specific conditions, the long patterns are extracted through level-wise-search. To reduce redundancy, GTK improves the generation method of subsume index and designs the new pruning strategies of candidates. This algorithm also takes the use of reasonable pruning strategies to reduce the amount of computation to improve the computational speed. Real datasets and synthetic datasets are adopted in experiments to evaluate the proposed algorithm. The experimental results show the obvious advantages in both time efficiency and space efficiency of GTK.


Introduction
The power of data mining facilitates every aspect of our lives, and some applications [Ruiz and Alisha (2019); Guo, Liu, Ren et al. (2019)] are very intuitive examples. The frequent patterns mining, one of the most popular areas of data mining research, was proposed by Agrawal et al. [Agrawal, Imieliński and Swami (1993)]. Its main task is to search for itemsets, sequences or structures with a support that is not less than the user-specified minimum support threshold in the dataset, among which frequent itemsets are the most basic frequent patterns. Apriori [Agrawal and Srikant (1994)] is the first mining algorithm of frequent patterns and was followed by two other classic algorithms [Ogihara, Zaki, Parthasarathy et al. (1997); Han, Pei, Yin et al. (2004)]. Those three are representative algorithms for horizontal, vertical and trie layouts. In addition, several algorithms [Zaki and Gouda (2003); Tsay and Chiang (2005); Vo, Coenen, Le et al. (2013); Vo, Le, Coenen et al. (2016)] also effectively implemented this mining task. With more advanced researches, frequent pattern mining has spawned a variety of extended algorithms, such as: (1) algorithms for closed frequent patterns mining [Zaki and Hsiao (2002); Wang, Han and Pei (2003) ;Fang, Wu, Li et al. (2015)]; (2) algorithms for maximal frequent patterns mining [Burdick, Calimlim, Flannick et al. (2005); Zeng, Pei, Wang et al. (2009); Yun and Lee (2016)]; (3) algorithms for high utility frequent patterns mining [Erwin, Gopalan and Achuthan (2007); Hu and Mojsilovic (2007); Yun and Ryang (2015)]; (4) algorithms for erasable frequent patterns mining [Hong, Lin, Lin et al. (2017); Le and Vo (2014) ;Nguyen, Le, Vo et al. (2015)]. Early mining algorithms of frequent patterns usually pre-set the support threshold, whose accuracy requires professional knowledge or experience and is too hard for ordinary users. If the threshold is set too high, the users' desired patterns cannot be fully detected. And conversely, many useless candidates are generated, which considerably reduces the mining efficiency and even causes the crash. To tackle this problem, Han et al. [Han, Wang, Lu et al. (2002)] proposed a top-k closed frequent pattern mining task, and designed TFP (top-k frequent closed patterns) algorithm [Wang, Han, Lu et al. (2005)]. Although no min_sup is not used in this algorithm, its core concept min_l is as difficult to predict as the traditional min_sup. In addition, due to the same support which may belong to different patterns, the final mining results may cause the missing of important patterns for users. Aiming at the two major defects of TFP, Deng et al. [Deng and Fang (2007)] introduced the concept of top-rank-k frequent patterns. In the circumstances, there's no need for min_sup and min_l to be predetermined. How to mine top-rank-k frequent patterns with high efficiency has received extensive attention in the industry and academia. In recent years, numerous improved algorithms [Fang and Deng (2008); Deng (2014); Huynh, ; Dam, Li, Fournier et al. (2016); Wang, Ren, N Davis et al. (2017) ;Jia, Xiang and Liu (2018)] have been produced. FAE (Vertical Mining of Top-rank-k Frequent Patterns, FAE) algorithm [Deng and Fang (2007)] adopts the horizontal data layout, uses heuristic rules and effective pruning strategies to reduce the mining space, and retains useful patterns for the expansion of long patterns. VTK (Vertical Mining of top-rank-k Frequent Patterns, VTK) algorithm [Fang and Deng (2008)] follows the level-wise-search method of FAE. In order to overcome the problem that many scans for dataset in FAE face, VTK adopts vertical data layout so that pattern information can be represented by Tid-list and the support of each pattern is obtained through calculating the length of its Tid-list. VTK achieves better results than FAE. However, when facing dense databases, VTK may experience performance degradation. To this end, Deng proposed the NTK (Fast mining top-rank-k frequent patterns by using Node-lists, NTK) algorithm Deng (2014)] that compresses dataset into a PPC-tree like FP-Tree and conducts a layer-wise mining of top-rank-k frequent patterns by extracting 1-patterns' Node-list [Deng and Wang (2010)]. In spite of the improvement compared with FAE and VTK, NTK has not produced a strategy to narrow the search range. iNTK (top-rank-k frequent patterns mining algorithm based on subsume index and N-list, iNTK) [Huynh, ] is an NTK-based optimized algorithm whose core structure adopts N-list [Deng, Wang and Jiang (2012)]. N-list, an efficient structure, composed of prefix nodes, the length of which is always shorter than that of Node-list composed of suffix nodes, takes less memory than the Node-list. Moreover, iNTK reduces the scope of pattern mining and speeds up the mining process by introducing subsume index [Song, Yang and Xu (2008)]. iNTK was shown to outperform NTK and its advantages increase as the k value increases. Although iNTK has been optimized for lifting efficiency, it still leaves some following issues left for BTK to solve. In response to the above problems, BTK (top-rank-k frequent patterns mining algorithm based on TB-tree, BTK) [ Dam, Li, Fournier et al. (2016)] gives some effective solutions: (1) It proposes a TB-tree and B-lists in which each node records its start-build and finish-build code to resolve the time-consuming construction of PPC-tree. (2) To avoid useless operations in iNTK, until the final candidates are found, patterns containing subsume indexes are combined. In the meantime, BTK also designs an EP (early pruning by threshold) strategy and an RSC (raising threshold by the support of candidates) strategy about B-list. An extensive experimental study has shown that BTK is superior to iNTK with a significant difference in the round [Dam, Li, Fournier et al. (2016)]. In summary, although a variety of algorithms have been proposed to achieve fast mining of top-rank-k frequent patterns, it is not yet efficient enough. Thus, without a highperformance method of top-rank-k frequent patterns mining, it is difficult to save runtime and memory usage. Consequently, this paper proposes an algorithm called GTK. Conclusively, GTK and BTK were compared through numerous experiments, and the experimental results reflect obvious advantages in both time efficiency and space efficiency of GTK. The main contributions of this paper are as follows: (1) Aiming at the problem that B-list takes up a lot of space and a long time for intersection function, an FPI (frequent pattern information) class is designed to represent pattern information by using vertical data structure. The FPI of a pattern includes three parts: pattern's items (Its), the subsume indexes of items (Si), and the bitset of pattern's Tids (Bs). The length of bitset is the support of the pattern. The Tids information represented by bitset is greatly compressed to save storage space. In the pattern mining, bitset performs bitwise AND operation for simplifying the intersection function of B-lists.
(2) To reduce the high maintenance cost of the top-rank-k table structure, a static doubly linked list structure named RSL (Static Doubly-Linked Lists of top-rank-k) is designed to store the top-rank-k frequent patterns. All nodes are listed in the RSL with a descending order of Support. As a linear table described by array, RSL only needs to modify cursor field instead of moving large numbers of elements when inserting and deleting nodes, which effectively reduces consumption and saves time cost.
(3) With regard to the defect of time consuming in level-wise-search, a hybrid mining algorithm is designed. According to the length threshold, the pattern is divided into short patterns and long patterns. For mining short patterns, rank-first-search is proposed based on the greedy strategy. Taking all short patterns with a length equals the length threshold as the input, long patterns are generated through level-wise-search. The algorithm is so efficient that it promotes fast generation of high rank patterns and reduces massive invalid joins. At the same time, related strategies are also designed to prune candidates.
(4) For the shortcoming of inefficient use of computing resources in the generation process of subsume index, an optimization method is proposed: the process of finding subsume indexes is embedded in the rank-first-search, and the qualified subsume indexes will be searched when mining the 2-patterns, therefore, it avoids 1-patterns being scanned repeatedly so as to make full use of computing resources. This method is efficient despite facing the sparse or dense datasets. The rest of this paper is organized as follows. The basic concepts are given in Section 2, highlights the problems associated with the current mainstream algorithms, and also conducts an analysis; Section 3 describes the construction and initialization of RSL; Section 4 details the design and implementation of the GTK algorithm, including the Ex_Short method, the Ex_Long method, the related pruning strategies and a simple example. Section 5 presents the experimental design, results and analysis; Section 6 draws a conclusion of the full text and forecasts the development tendency of future research.

Basic definitions and problem analysis 2.1 Problem of mining top-rank-k frequent patterns
Related definitions of frequent patterns can be found in Agrawal et al. [Agrawal and Srikant (1994)]. The relevant basic concepts and problem of the mining top-rank-k frequent patterns in the literature [7] are described below. The Rank of a Pattern. Given a transaction database D and a pattern A (A⊆ I), R A, the rank of A, is defined below, where |Y| is the number of elements in Y. R A =|{X�Sup|X⊆I and X�Sup≥A�Sup}| (1) Top-rank-k Frequent Patterns. Given a transaction database D and a threshold k, a pattern A (A ⊆ I) is referred as a top-rank-k frequent pattern if and only if RA is not greater that k. That is, RA ≤ k. Top-rank-k Frequent Patterns Mining. Given a transaction database D and a threshold k, the top-k frequent patterns mining is the task of finding the complete set of frequent patterns whose ranks are not greater than k, that is, the set of top-rank-k frequent pattern is equal to Stop-k, the minimum support which is equal to the support threshold denoted as S t in this paper:

The subsume indexes of frequent 1-patterns
The subsume index is proposed by Song et al. [Song, Yang and Xu (2008)] to reduce the search scope in the pattern mining process, which is defined as: The subsume index of pattern (the representative item [Song, Yang and Xu (2008)]) is an itemset, which means that if Y ∈ Subsume(X), according to some order " ≺ "(e.g., lexicographic order), then the Tids of X are the subset of Tids of Y. Obviously, the support of the union of X with any nonvoid subset of Subsume(X) is equal to Sup(X), conversely, if the support of the union of Y one of 1-patterns with X equals Sup(X), so Y ∈Subsume(X). Example 2. In Tab. 2, g([d])=Tids{1, 3}, g([b])=Tids{1, 2, 3, 4, 5, 6}, then it is easy to find that |g ([d, b]

Problem analysis
This paper argues that: S t is the determinant for the top-rank-k frequent patterns. S t of the mainstream algorithm raises dynamically with the mining process until it is finally determined, crucially, the speed of this process affects the performance of the algorithm directly. So, the key to improve the mining efficiency of top-rank-k frequent patterns is to accelerate the rise of S t . Proceeding from this view, this paper proposes an algorithm called GTK which focuses on the crucial point of accelerating the rise of S t to find the final S t as early as possible. The study proves that when the k is constant and within a reasonable range, the speed of mining high rank patterns is positively correlated with the speed of raising S t . Therefore, to improve mining efficiency, algorithms should be made to speed up the generation of high rank patterns. Fast generation of high rank patterns is a challenging work that almost all tr-aditional algorithms avoid. To implement the mining of top-rank-k frequent patterns, traditional algorithms generally adopt the level-wise-search which obtains the (t+1) patterns by joining t-patterns. This method is not efficient enough. When k equals to 5, a simple example is shown in Fig. 1, in which the horizontal number is pattern's rank and the gray area refers to the top-rank-k frequent patterns. As shown, the level-wise-search performs 8 joins in total. By 7 joins, the final S t is determined as the support of the pattern   In order to break through the constraint of level-wise-search, based on the greedy strategy, this paper proposes the rank-first-search method just introduced. Fig. 2 shows the process of rank-first-search with the same dataset and k.
[AB] with a rank of 3 is generated first, and thereafter, the final S t is determined by only 3 joins. As a result, high rank patterns such as {[AB], [AC], [BC]} are generated fast, moreover, benefiting from fast rising S t , [D] has been removed from the frequent patterns and can't join with any other pattern. Consequently, there is a reduction of 3 useless joins, which may reach thousands or even tens of thousands on the experimental dataset. In comparison, the computational cost of rank-first-search is less than 1/2 of that of level-wise-search in the example. Thus, the argument presented above are verified.

Basic definitions and properties
Definition 1 (FPI of a 1-pattern). Given a transaction dataset D and a pattern X, the class consisting of (X.Its, X.Bs, X.Si) is called FPI of X and is denoted as FX. Its means items, Bs is the bit set of Tids of X and its size equals the support of X (X.Sup), Si is the set of X's subsume indexes (described in the next chapter  Conclusion can be drawn from the Inference 1: Compared with the Join of any pattern and a low rank pattern, the Join of that pattern and a high rank pattern has a better chance of generating a high rank pattern. This conclusion provides a more powerful theoretical support for the analysis in Section 2. It is also the most direct basis for adopting rankfirst-search that always joins the two patterns with highest rank to promote the early generation of high rank patterns.

The generation method of subsume indexes
Different from other patterns where the join operation must be performed, those patterns, including a representative item that has the same support as a representative item, can be generated directly by connecting the representative item with all the subsets of subsume index. To find the subsume indexes, the method of BTK is to check the definition of subsume index in Section 2.2 from the opposite direction when traversing 1-pattern list. However, when there are few subsume indexes of the dataset, this method has little contribution to the mining progress. Meanwhile, it will repeat the traversal of the 1-pattern list when calling the Candidate_gen method, the cost for repeat is unnecessary and with side effects on mining. So, this method does not make full use of computing resources and it is too time-consuming. In this paper, an optimization method of subsume index generation method called Gen_Subsume is given and mainly reflected in two aspects: (1) Embed the generation process of subsume index into the process of candidate generation instead of before the process of candidate generation to be more time-saving. In this paper, when mining 2-patterns in the process of candidate generation, a 2-pattern X will be generated by joining two 1-patterns. According to Property 1, if the Bs size of F X is equal to the support of one of these two patterns, then another pattern is recorded as subsume index. Therefore, there is no need to scan 1-patterns list repeatedly. The advantage of Gen_Subsume is that while scanning 1-patterns list, not only the subsume indexes are found, but also all frequent 2-patterns are obtained.
(2) Use the Bs of FPI for bitwise and operation instead of the B-info-code of B-list for comparison for high efficiency. Gen_Subsume makes use of bitsets to reduce the amount of pattern information storage space needed and take advantage of bit-level parallelism in hardware to increase performance. After Gen_Subsume, the union of 1-patterns and 2-patterns including subsume indexes will be obtained and used to prepare for RSL initialization. Details of Gen_Subsume are presented in Fig. 3. Initially, Si of each FPI is initialized as the empty set, the S t is set to zero. C1(1-patterns) is obtained and arranged in descending order of support. If the length of C1 is longer than k, S t is updated, then it produces a frequent 1-patterns set L1 and a set of Sup Lsup. Then the pattern in a range from the second one to the last one of L1, joins with each pattern in front of it one by one, all 2-patterns generated by this pattern are stored in Ltmp. Once the subsume indexes of the pattern is found, the Si of each pattern in Ltmp should be updated. During this process, S t has been rising.

Structure of RSL
All kinds of algorithms based on PPC-tree or TB-tree structure applied a table structure called Tabk for storing top-rank-k frequent patterns. Tabk has a fixed number of entries, usually k, and each entry contains all patterns with the same rank. During the mining proce-  Definition 3 (Rnode). Rnode, the node of RSL, is composed of two parts, the cursor domain and the data domain. The cursor domain contains a link back to the previous node (prev) and a link to the next node (next). The data domain including a support (Sup) and an FS, an FPI set of the patterns with support equal to Sup. The structure of Rnode is shown as Fig. 4. Definition 4 (RSL). RSL, a static doubly-linked list structure described by an array with a fixed length of k+1, is made up of a head node and k Rnodes. The head node only points to the highest Rnode of rank without storing any information of patterns. All Rnodes of RSL are arranged in a descending order of rank. RSL boasts the advantages of both sequential storage structure and linked storage structure. During the insertion and deletion operations, it only needs to modify the cursor of Rnode without any element movement required. So, the consumption of insert and delete operations of Tabk is improved.

RSL initialization
The construction of RSL is implemented by the Append function and the Sort function. The initialization process of RSL is shown in Fig. 5, where S is used to store the Sup of each Rnode and L is used to record the number of nonempty Rnodes of the RSL. When k=5, take the LU shown in Tab. 4 as an example, the RSL after initialization is shown in Tab. 5.   Proof. When none of the top-rank-k frequent patterns belong to the same rank as the rest of the patterns, there will be only k top-rank-k frequent patterns, that is, only one pattern per rank. At this time, the number of top-rank-k frequent patterns is the smallest, thus the items constituting the top-rank-k frequent pattern set are the least. In the most ideal case, the top-rank-k frequent patterns with threshold k are composed of N items with the highest support, and the supports of those patterns are not the same. At this time, It can be seen from the above that in the most ideal case, only the items N are required to obtain the top-rank-k frequent patterns, and the maximum length of the top-rank-k frequent patterns equals N. However, in normal times, more than N items are used to make up top-rank-k frequent patterns, so the maximum length of top-rank-k frequent pattern will not be greater than N. According to Theorem 1, this paper divides the patterns of different length into short patterns and long patterns, and defines that: Definition 4. Assume that there is a pattern of length l and an η, an integer length threshold of pattern. Let η= ⌊ log 2 (k+1) /2⌋ (η is at least 3), if l≤η, then p is called short pattern, otherwise p is a long pattern. In the previous problem analysis, this paper has explained the defect of level-wise-search by examples. In this regard, this chapter focuses on the fast generation of high rank patterns and proposes a mixed search method in terms of short patterns and long patterns of toprank-k frequent patterns. In what follows, Section 4.1 describes how the Ex_Short method is used to mine short top-rank-k frequent patterns by adopting rank-first-search; Details of the GTK algorithm including the long patterns mining method Ex_Long are introduced in Section 4.2; For a better understanding of GTK, an example is shown in the Section 4.3.

Ex_Short method
The Ex_Short method ignores the limit of pattern length and takes advantage of rankfirst-search to prompt the fast generation of high rank patterns. In order to accelerate the Ex_Short process, this paper also designed a PBJ strategy and CC strategy.

PBJ strategy (pruning before joining)
The main role of the PBJ strategy is to define the basic conditions of the FPI in the Join operation. Let the FPI of pattern X be FX, the FPI of pattern Y be FY, and Y.Sup>X.Sup. Firstly, since the 2-patterns has been obtained by looking for the subsume indexes of 1patterns, the 2-patterns is no longer mined in the Ex_Short process, so |FX.Its| and |FY.Its| cannot be equal to 1 at the same time. Secondly, the new pattern's length cannot be greater than η, nor X and Y, the subsets of new patterns, |FX.Its| and |FY.Its| should be less than η. Finally, because of the subsume index, to avoid useless Join operations, X and Y should satisfy the following:  F[a]). Thus, when performing Join, it is necessary to check in time whether the new pattern has already existed to avoid double counting. To this end, the candidate pattern set Ci is set up to store all patterns with equal lengths, where i>2 and i is the length of pattern. Therefore, the significance of the CC strategy is: When pattern X is found, it should be filtered out immediately if it can be found in C|x| .

Ex_Short method. (extracting short patterns)
The GTK algorithm produces the short top-rank-k frequent patterns through the Ex_Short method shown in Fig. 6. This procedure loops over each Rnode of which the support is  PBJstrategy is met (line 6), Join function will be conducted (lines 7), and then the candidate will be checked by CC strategy, and if it has a greater support than S t , it will be appended into RSL and the candidate pattern set (lines 1-10). When two Rnodes are continuous (lines 9), join each FPI in the FS of the latter Rnode as the above procedure (lines 10-16). So far, all short top-rank-k frequent patterns with length less than η are stored in the RSL in descending order. In the meantime, the St has been raised rapidly.

GTK algorithm
Since the GTK algorithm ignores the pattern length and only focuses on the highest rank patterns, it has great advantages in mining short patterns in the early stage of the algorithm. However, as patterns gradually increase, the FPI that conforms to the PBJ strategy is relatively reduced. Therefore, it is more and more time-consuming to traverse the Rnode one by one to find the long patterns. In order to solve this problem, after the Ex_Short procedure, GTK algorithms use the level-wise-search for short patterns with a length η to mine the long frequent patterns.
Because of the few long top-rank-k frequent patterns and the low cost of matching eligible short patterns, the hybrid mining adopted by GTK algorithms is more effective than using level-wise-search or rank-first-search alone. The GTK algorithm is shown in Fig. 7. The GTK algorithm first mines the short top-rank-k frequent patterns (lines 1-4), and then extracts patterns with length of η from the RSL to construct the set Lη. It arranges Lη in descending order of pattern's support (line 5). After that, Ex_Long method is called to perform the long top-rank-k frequent pattern mining (line 6). At last, each pattern with subsume indexes is combined with its nonvoid subsets of subsume indexes to get the whole top-rank-k frequent patterns (lines 7-12).
The Ex_Long is a level-wise-search method that uses a loop to explore long patterns of greater length until no candidate can be generated. While Li is not empty, create Li+1, a new set of patterns, to store the pattern with length of i+1 (lines 2-3), and then join the patterns in Li with each pattern in front. In the process, first get the subscript in RSL of the later (lines 7-9), after that generate the candidate, insert it into RSL and update the Li+1 (lines 10-26). Line 27 is the beginning of the next cycle.When the number of patterns in Li does not exceed 1, the mining process ends.

Illustration
In the top-rank-k frequent patterns mining of conventional datasets, as η is assumed to be not less than 3 in this paper, k is greater than or equal to 64. Due to the large value of k, the number of the examples is limited. Because of the same principle, this paper let η=2, k=5, and takes the initialized RSLe in Tab. 6 as an example, the RSLe after Ex_Short procedure is presented in Tab. 7, the final Stop-k is shown in Tab. 8.   1,1,1,1,1,1,0

Experimental results
To accurately evaluate the performance of GTK, this paper adopts three methods, with the purpose of: • Determining the reliability of subsume indexes generation method proposed in this paper. • Verifying the validity of the rank-first-search based on greedy strategy. • Evaluating the comprehensiveness and efficiency of the GTK algorithm in terms of time and memory usage.
Therefore, six datasets 4 with different characteristics, namely Chess, Connect, Mushroom, Pumsb, Retail, T10I4D100K, and two synthetic datasets Test990.99KD1 and Test2K50KD1 generated by LUCS-KDD 5 data generator are selected. Tab. 9 shows the characteristics of these datasets, including the numbers of items(num_Items) and transactions(num_Trans). A laptop with the built-in Intel ® Core TM 3.0 GHz CPU and 12 G memory is equipped for running the tests. All programs are implemented in SCALA on the IntelliJ IDEA2018 software of win10 operating system.

The time of subsume indexes generation
Tab. 10 shows the time at which the BTK and GTK algorithms get the subsume indexes of 1-patterns. It can be clearly seen that BTK needs more time to generate subsume indexes, while GTK is more time-saving. There are three reasons for this: (1) BTK uses all 1-patterns' B-lists to search subsume indexes, but there are not so many 1-patterns that can become top-rank-k frequent patterns. Therefore, the large search range increases the calculation time, and the effect may be even worse in the case of sparse datasets with large-scale 1-patterns such as Retail and Test2K50KD1. This paper has a preliminary filtering of the 1-patterns, thus reducing the time consumption.
(2) The generation of subsume indexes in BTK requires frequent calls to the checkSubsume function. This is a time-consuming process for B-lists that contain a large amount of Binfo-code. GTK uses bitwise AND operation to reduce computational complexity and facilitate the fast generation of subsume indexes.
(3) In this paper, subsume indexes are generated through the acquisition of frequent 2patterns. In this process, as the threshold continues to increase, some 1-patterns may not be able to be joined, so the search range of subsume indexes will become smaller and smaller. In addition, by observing the specific values of the experiment, it can be found that the GTK gets subsume indexes about 10 times faster than the BTK no matter in the face of sparse or dense datasets. Especially when dealing with pumsb, Test990.99KD1 and Test2K50KD1, the speed gap is even 100 times. Thus, the reliability of the subsume indexes generation method proposed in this paper can be confirmed.

Efficiency of the hybrid-search
The BTK algorithm is based on level-wise-search, while GTK is implemented by a way of hybrid-search. This method consists of two parts: the rank-first-search is firstly used to find short frequent patterns, and then the level-wise search is used to mine long frequent patterns. Because the high threshold can avoid numerous useless joins and reduce the amount of calculation, the purpose of using hybrid search is to speed up the support threshold rise so that final threshold could be quickly determined.
Tab. 11 shows the number of candidate patterns generated by the two algorithms during the mining process. As shown in the table, the candidate pattern generated in GTK algorithms is always less than BTK algorithms, and the gap increases as k increases. The cardinality of the pattern to be generated is positively related to the value of k. As the rank-first-search always gives priority to the generation of high rank patterns, in unit time, the rank-first-search has a greater chance to get more high rank patterns. It is a very good method to promote a fast rise in the threshold that the rapidly rising threshold can greatly reduce the amount of search computation. In contrast, since the level-wise search is limited by length requirement of the pattern, there is a delay in the generation of high rank patterns. Moreover, it is quite time-consuming to search a lot of patterns with levelwise-search, especially the search of 2 and 3-patterns. Therefore, this is one aspect of the evidence that hybrid-search is more effective than level-wise-search. To highlight the advantages of hybrid-search, the dynamic rise of support threshold of these two search ways is recorded in Fig. 8, where the x-axis represents the threshold and the y-axis represents the time when the corresponding threshold is reached. It is not difficult to observe that compared with level-wise-search, the support threshold is improved faster by hybrid-search, and the larger the threshold is, the larger the gap is. In general, the hybrid-search proposed in this paper outgoes the level-wise-search.

Mining performance
This paper evaluates the time efficiency and space efficiency of the GTK algorithm by testing time and memory usage on all eight datasets for various values of k. It should be noted that the timing starting points of the two algorithms are different. BTK starts from calling Candidate_gen function, while GTK starts with getting the 1-pattern subsume indexes. The experimental results show that GTK is more time efficient than BTK when facing different datasets and k, and the mining time of GTK is less affected by the increase of k, while BTK is the opposite. There are many factors contributing to the efficiency of GTK, including the following: The main reason is no other than the fast rank-first-search, and its advantage is most obvious when k is less than 500. This is because when k does not exceed 500, there are not so many patterns to be mined, and most patterns are short patterns with a length less than 4 that can be mined quickly. What is more, using RSL to store frequent patterns saves a lot of time. It can be seen from the Fig. 11, when k is larger, more candidate patterns will be generated. Storing these patterns involves abundant operations, which will cause huge time consumption for BTK. However, RSL of GTK only need to modify the cursor when performing insert and sort operations. Thus, the time consumption is much smaller than GTK. Furthermore, GTK's efficiency also benefits from subsume indexes, PBJ and CC strategies, which avoid a lot of useless joins. To sum up, GTK is not as sensitive to sparseand dense datasets as BTK, and GTK is about 6 times faster than BTK. When facing retail, the time gap between the two algorithms is the largest that GTK's time consumption is only 1/20 of that of BTK. Since the experimental programs are written in the JVM-based SCALA language, the memory usage here refers to the peak usage memory of the JVM during program running. JVM peak memory is dynamic and variable. In this paper, the average value of several experiments is taken to reflect the size of the actual memory occupied by the algorithm. As you can see from the figures, for the same dataset, GTK takes up less memory when the k of both is the same. There are three main reasons: (1) Compared with BTK, in which B-list is used to store the start-build and finish-build information of the patterns, GTK uses bitset to represent the pattern's transaction sets for space saving. Especially in the face of large datasets, the B-lists with a large amount of B-info-code lead to a lot of pressure on memory.
(2) As shown in Fig. 8, the threshold rises rapidly by rank-first-search, thus the number of candidate patterns has been greatly reduced. Rankfirst-search is also a very space efficient method. (3) The GTK algorithm only needs to maintain the array RSL and Ci used to store patterns of the same length during the mining process, so the space cost is low. The fixed length of RSL is k+1, and when it is full of k Rnodes, each new Rnode will only overwrite the tail Rnode Instead of reopening memory space. In addition, Ci only needs to store short frequent patterns other than 1-patterns and 2-patterns, which usually could not cause huge memory consumption. It is worth noting that BTK occupies up to 2.5 G of memory in the face of Test990.99KD1 and Test2K50KD1 (Figs. 14 and 16). It is clear that BTK compresses large datasets onto a unique TB-Tree, which is very large and memory-consuming, Therefore, it is susceptible to memory performance. However, GTK adopts a vertical data layout, it will not be easily affected by such problems.

Conclusion and future work
A hybrid mining algorithm of top-rank-k frequent pattern called GTK is proposed in this paper. GTK uses a static doubly linked list and a mining method of hybrid-search based on greedy strategy to find frequent patterns. To speed up the process, an optimized subsume indexes generation method and several useful pruning strategies are also designed. Experimental results show that the proposed algorithm has a better space-time efficiency. In addition, during the experiment, it was found that the mining efficiency can further be improved by appropriately reducing the value of η when faced with some sparse datasets or large datasets. Concurrently, the experimental data also reflect some aspects of GTK that can be improved. For example, when dealing with dense data sets, it may be more time and space saving to compute the difference of the Tids than to compute the intersection. In recent years, big data field is increasingly becoming more popular because of its powerful application function. Therefore, the parallel mining of top-rank-k frequent patterns will continue to be studied.