Abstract
The increasing prominence of data streams arising in a wide range of advanced applications such as fraud detection and trend learning has led to the study of online mining of frequent itemsets (FIs). Unlike mining static databases, mining data streams poses many new challenges. In addition to the one-scan nature, the unbounded memory requirement and the high data arrival rate of data streams, the combinatorial explosion of itemsets exacerbates the mining task. The high complexity of the FI mining problem hinders the application of the stream mining techniques. We recognize that a critical review of existing techniques is needed in order to design and develop efficient mining algorithms and data structures that are able to match the processing rate of the mining with the high arrival rate of data streams. Within a unifying set of notations and terminologies, we describe in this paper the efforts and main techniques for mining data streams and present a comprehensive survey of a number of the state-of-the-art algorithms on mining frequent itemsets over data streams. We classify the stream-mining techniques into two categories based on the window model that they adopt in order to provide insights into how and why the techniques are useful. Then, we further analyze the algorithms according to whether they are exact or approximate and, for approximate approaches, whether they are false-positive or false-negative. We also discuss various interesting issues, including the merits and limitations in existing research and substantive areas for future research.
Similar content being viewed by others
References
Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Buneman P, Jajodia S (eds) Proceedings of the ACM SIGMOD international conference on management of data, Washington DC, pp 207–216
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Bocca J, Jarke M, Zaniolo C (eds) Proceedings of 20th international conference on very large data bases, Santiago de Chile, Chile, September 1994, pp 487–499
Agrawal R, Srikant R (1995) Mining sequential patterns. In: Yu P, Chen A (eds) Proceedings of the eleventh international conference on data engineering, Taipei, Taiwan, March 1995, pp 3–14
Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Popa L (eds) Proceedings of the twenty-first ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, Wisconsin, USA, June 2002, pp 1–16
Bonchi F and Lucchese C (2005). On condensed representations of constrained frequent patterns. Knowl Inf Syst 9(2): 180–201
Boulicaut J, Bykowski A and Rigotti C (2003). Free-sets: a condensed representation of boolean data for the approximation of frequency queries. Data Min Knowl Discov 7(1): 5–22
Brin S, Motwani R, Silverstein C (1997) Beyond market basket: generalizing association rules to correlations. In: Peckham J (eds) Proceedings of the ACM SIGMOD international conference on management of data, Arizona, May 1997, pp 265–276
Calders T, Goethals B (2002) Mining all non-derivable frequent itemsets. In: Elomaa T, Mannila H, Toivonen H (eds) Proceedings of the principles of data mining and knowledge discovery, 6th European conference, Helsinki, Finland, August 2002, pp 74–85
Chang JH, Lee WS (2003) Finding recent frequent itemsets adaptively over online data streams. In: Getoor L, Senator T, Domingos P, Faloutsos C (eds) Proceedings of the Ninth ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, August 2003, pp 487–492
Chang JH, Lee WS (2003) stWin: adaptively monitoring the recent change of frequent itemsets over online data streams. In: Proceedings of the 2003 ACM CIKM international conference on information and knowledge management, New Orleans, Louisiana, USA, November 2003, pp 536–539
Chang JH and Lee WS (2004). A sliding window method for finding recently frequent itemsets over online data streams. J Inf Sci Eng 20(4): 753–762
Charikar M, Chen K and Farach-Colton M (2004). Finding frequent items in data streams. Theor Comput Sci 312(1): 3–15
Chen Y, Dong G, Han J, Wah BW, Wang J (2002) Multi-dimensional regression analysis of time-series data streams. In: Proceedings of the 28th international conference on very large data bases, Hong Kong, August 2002, pp 323–334
Cheng J, Ke Y, Ng W (2006) Maintaining frequent itemsets over high-speed data streams. In: Ng WK, Kitsuregawa M, Li J, Chang K (eds) Proceedings of the 10th Pacific-asia Conference on knowledge discovery and data mining, Singapore, April 2006, pp 462–467
Cheng J, Ke Y, Ng W (2006) δ-Tolerance closed frequent itemsets. In: Proceedings of the 6th IEEE international conference on data mining, Singapore, Hong Kong, December 2006, pp 139–148
Chernoff H (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann Math Stat 23(4): 493–507
Chi Y, Wang H, Yu P, Muntz R (2004) Moment: maintaining closed frequent itemsets over a stream sliding window. In: Proceedings of the 4th IEEE international conference on data mining, Brighton, UK, November 2004, pp 59–66
Chi Y, Wang H, Yu P and Muntz R (2006). Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowl Inf Syst 10(3): 265–294
Cormode G, Muthukrishnan S (2003) What’s hot and what’s not: tracking most frequent items dynamically. In: Proceedings of the twenty-second ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, San Diego, June 2003, pp 296–306
Garofalakis M, Gehrke J, Rastogi R (2002) Querying and mining data streams: you only get one look a tutorial. In: Franklin M, Moon B, Ailamaki A (eds) Proceedings of the 2002 ACM SIGMOD international conference on management of data, Wisconsin, June 2002, pp 635
Giannella C, Han J, Pei J, Yan X, Yu P (2004) Mining frequent patterns in data streams at multiple time granularities. In: Kargupta H, Joshi A, Sivakumar D, Yesha Y (eds) Data mining: next generation challenges and future directions, MIT/AAAI Press, pp 191–212
Goethals B, Zaki M (2003) FIMI ’03, Frequent itemset mining implementations. In: Proceedings of the ICDM 2003 workshop on frequent itemset mining implementations, December 2003, Melbourne, Florida, USA
Golab L and Özsu MT (2003). Issues in data stream management. SIGMOD Rec 32(2): 5–14
Gouda K, Zaki M (2001) Efficiently mining maximal frequent itemsets. In: Cercone N, Lin TY, Wu X (eds) Proceedings of the 2001 IEEE international conference on data mining, San Jose, 29 November – 2 December 2001, pp 163–170
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Chen W, Naughton J, Bernstein P (eds) Proceedings of the 2000 ACM SIGMOD international conference on management of data, Texas, May 2000, pp 1–12
Hidber C (1999) Online association rule mining. In: Delis A, Faloutsos C, Ghandeharizadeh S (eds) Proceedings of the ACM SIGMOD international conference on management of data, Philadelphia, Pennsylvania, June 1999, pp 145–156
Jin C, Qian W, Sha C, Yu J, Zhou A (2003) Dynamically maintaining frequent items over a data stream. In: Proceedings of the 2003 ACM CIKM international conference on information and knowledge management, New Orleans, Louisiana, USA, November 2003, pp 287–294
Jin R, Agrawal G (2005) An algorithm for in-core frequent itemset mining on streaming data. In: Proceedings of the 5th IEEE international conference on data mining, Houston, Texas, USA, November 2005, pp 210–217
Lee D, Lee W (2005) Finding maximal frequent itemsets over online data streams adaptively. In: Proceedings of the 5th IEEE international conference on data mining, Houston, Texas, USA, November 2005, pp 266–273
Lee C, Lin C, Chen M (2001) Sliding-window filtering: an efficient algorithm for incremental mining. In: Proceedings of the 2001 ACM CIKM international conference on information and knowledge management, Atlanta, Georgia, USA, November 2001, pp 263–270
Li H, Lee S, Shan M (2004) An efficient algorithm for mining frequent itemsets over the entire history of data streams. In: Proceedings of the first international workshop on knowledge discovery in data streams, in conjunction with the 15th European conference on machine learning ECML and the 8th European conference on the principals and practice of knowledge discovery in databases PKDD, Pisa, Italy, 2004
Liu B, Hsu W, Ma Y (1998) Integrating classification and association rule mining. In: Agrawal R, Stolorz P, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery and data mining, New York, August 1998, pp 80–86
Manjhi A, Shkapenyuk V, Dhamdhere K , Olston C (2005) Finding (recently) frequent items in distributed data streams. In: Proceedings of the 21st international conference on data engineering, Tokyo, Japan, April 2005, pp 767–778
Manku GS, Motwani R (2002) Approximate frequency counts over data streams. In: Proceedings of the 28th international conference on very large data bases, Hong Kong, August 2002, pp 346–357
Mannila H, Toivonen H and Verkamo AI (1997). Discovery of frequent episodes in event sequences. Data Min Knowl Discov 1(3): 259–289
Omiecinski E (2003). Alternative interest measures for mining associations in databases. IEEE Trans Knowl Data Eng 15(1): 57–69
Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Beeri C, Buneman P (eds) Proceedings of the 7th international conference on database theory, Jerusalem, Israel, January 1999, pp 398–416
Pavan A, Tirthapura S (2005) Range efficient computation of F0 over massive data streams. In: Proceedings of the 21st international conference on data engineering, Tokyo, Japan, April 2005, pp 32–43
Pei J, Dong G, Zou W and Han J (2004). Mining condensed frequent-pattern bases. Knowl Inf Syst 6(5): 570–594
Srivastava U, Widom J (2004) Memory-limited execution of windowed stream joins. In: Nascimento et al. (eds) Proceedings of the thirtieth international conference on very large data bases, Toronto, Canada, August 31 – September 3 2004, pp 324–335
Toivonen H (1996) Sampling large databases for association rules. In: Vijayaraman TM, Buchmann A, Mohan C, Sarda N (eds) Proceedings of the 22nd international conference on very large data bases, Mumbai (Bombay), India, September 1996, pp 134–145
Wang J, Han J, Pei J (2003) CLOSET + : searching for the best strategies for mining frequent closed itemsets. In: Getoor L, Senator T, Domingos P, Faloutsos C (eds) Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, August 2003, pp 236–245
Wang H, Yang J, Wang W, Yu P (2002) Clustering by pattern similarity in large data sets. In: Franklin M, Moon B, Ailamaki A (eds) Proceedings of the 2002 ACM SIGMOD international conference on management of data, Wisconsin, June 2002, pp 394–405
Xin D, Han J, Yan X, Cheng H (2005) Mining compressed frequent-pattern sets. In: BÖhm et al. (eds) Proceedings of the 31st international conference on very large data bases, Trondheim, Norway, September 2–August 30, 2005, pp 709–720
Yu J, Chong Z, Lu H, Zhou A (2004) False positive or false negative: mining frequent itemsets from high speed transactional data streams. In: Nascimento et al. (eds) Proceedings of the thirtieth international conference on very large data bases, Toronto, Canada, September 3–August 31, 2004, pp 204–215
Zaki M (2000) Generating non-redundant association rules. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, August 2000, pp 34–43
Zaki M, Hsiao CJ (2002) CHARM: an efficient algorithm for closed itemset mining. In: Grossman et al. (eds) Proceedings of the second SIAM international conference on data mining, Arlington, VA, USA, April 2002
Zaki M, Parthasarathy S, Li W, Ogihara M (1997) Evaluation of sampling for data mining of association rules. In: Proceedings of the research issues in data engineering, Birmingham, England, 1997
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cheng, J., Ke, Y. & Ng, W. A survey on algorithms for mining frequent itemsets over data streams. Knowl Inf Syst 16, 1–27 (2008). https://doi.org/10.1007/s10115-007-0092-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-007-0092-4