A scalable algorithm for mining maximal frequent sequences using a sample

Luo, Congnan; Chung, Soon M.

doi:10.1007/s10115-006-0056-0

A scalable algorithm for mining maximal frequent sequences using a sample

Regular Paper
Published: 24 January 2007

Volume 15, pages 149–179, (2008)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Congnan Luo¹ &
Soon M. Chung¹

170 Accesses
17 Citations
Explore all metrics

Abstract

In this paper, we propose an efficient scalable algorithm for mining Maximal Sequential Patterns using Sampling (MSPS). The MSPS algorithm reduces much more search space than other algorithms because both the subsequence infrequency-based pruning and the supersequence frequency-based pruning are applied. In MSPS, a sampling technique is used to identify long frequent sequences earlier, instead of enumerating all their subsequences. We propose how to adjust the user-specified minimum support level for mining a sample of the database to achieve better overall performance. This method makes sampling more efficient when the minimum support is small. A signature-based method and a hash-based method are developed for the subsequence infrequency-based pruning when the seed set of frequent sequences for the candidate generation is too big to be loaded into memory. A prefix tree structure is developed to count the candidate sequences of different sizes during the database scanning, and it also facilitates the customer sequence trimming. Our experiments showed MSPS has very good performance and better scalability than other algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mining sequential patterns with itemset constraints

Article 01 February 2018

SPaMi-FTS: An Efficient Algorithm for Mining Frequent Sequential Patterns

Mining Frequent Sequences Using Itemset-Based Extension

References

Agarwal RC, Aggarwal CC, Prasad VVV (2000) Depth first generation of long patterns. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, pp 108–118
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th VLDB conference, pp 487–499
Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of the international conference on data engineering, pp 3–14
Ayres J, Gehrke J, Yiu T, Flannick J (2002) Sequential pattern mining using a bitmap representation. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 429–435
Bayardo RJ (1998) Efficient mining long patterns from databases. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 85–93
Burdick D, Calimlim M, Gehrke J (2001) MAFIA: a maximal frequent itemset algorithm for transaction databases. In: Proceedings of the international conference on data engineering, pp 443–452
Chen B, Haas P, Scheuermann P (2002) A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 462–468
Chiu D, Wu Y, Chen ALP (2004) An efficient algorithm for mining frequent sequences by a new strategy without support counting. In: Proceedings of the international conference on data engineering, pp 375–386
Chung SM, Luo C (2004) Distributed mining of maximal frequent itemsets from databases on a cluster of workstations. In: Proceedings of the 4th IEEE/ACM international symposium on cluster computing and the grid—CCGrid 2004
Domingo C, Gavalda R, Watanabe O (1999) On-line sampling methods for discovering association rules. Tokyo Tech Rep. C-126. Department of Math and Computing Science, Tokyo Institute of Technology, Tokyo, Japan
Domingo C, Gavalda R, Watanabe O (2002) Adaptive sampling methods for scaling up knowledge discovery algorithms. Data Min Knowl Discov 6(2):131–152
Article MATH MathSciNet Google Scholar
Masseglia F, Cathala F, Poncelet P (1998) The PSP approach for mining sequential patterns. In: Proceedings of the European symposium on principle of data mining and knowledge discovery, pp 176–184
Mendenhall W, Sincich T (1995) Statistics for engineering and the sciences, 4th edn. Prentice-Hall, Englewood Cliffs, NJ
MATH Google Scholar
Park JS, Chen MS, Yu PS (1997) Using a hash-based method with transaction trimming for mining association rules. IEEE Trans Knowl Data Eng 9(5):813–825
Article Google Scholar
Pei J, Han J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U, Hsu MC (2001) PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of the international conference on data engineering, pp 215–224
Shintani T, Kitsuregawa M (1998) Mining algorithms for sequential patterns in parallel: hash based approach. In: Proceedings of the Pacific-Asia conference on research and development in knowledge discovery and data mining, pp 283–294
Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the 5th international conference on extending database technology, pp 3–17
Toivonen H (1996) Sampling large databases for association rules. In: Proceedings of the 22nd VLDB conference, pp 134–145
Wang J, Han J (2004) BIDE: efficient mining of frequent closed sequences. In: Proceedings of the international conference on data engineering, pp 79–90
Yan X, Han J, Afshar R (2002) CloSpan: mining closed sequential patterns in large datasets. In: Proceedings of the SIAM international conference on data mining, pp 166–177
Yang J, Wang W, Yu PS, Han J (2002) Mining long sequential patterns in a noisy environment. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 406–417
Zaki MJ, Parthasarathy S, Li W, Ogihara M (1997) Evaluation of sampling for data mining of association rules. In: Proceedings of the 7th international workshop on research issues in data engineering, pp 42–50
Zaki MJ (2001) SPADE: an efficient algorithm for mining frequent sequences. Mach Learn 42(1):31–60
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Wright State University, Dayton, OH, 45435, USA
Congnan Luo & Soon M. Chung

Authors

Congnan Luo
View author publications
You can also search for this author in PubMed Google Scholar
Soon M. Chung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Soon M. Chung.

Additional information

Congnan Luo received the B.E. degree in Computer Science from Tsinghua University, Beijing, P.R. China, in 1997, the M.S. degree in Computer Science from the Institute of Software, Chinese Academy of Sciences, Beijing, P.R. China, in 2000, and the Ph.D. degree in Computer Science and Engineering from Wright State University, Dayton, OH, in 2006. Currently he is a technical staff at the Teradata division of NCR in San Diego, CA, and his research interests include data mining, machine learning, and databases.

Soon M. Chung received the B.S. degree in Electronic Engineering from Seoul National University, Korea, in 1979, the M.S. degree in Electrical Engineering from Korea Advanced Institute of Science and Technology, Korea, in 1981, and the Ph.D. degree in Computer Engineering from Syracuse University, Syracuse, New York, in 1990. He is currently a Professor in the Department of Computer Science and Engineering at Wright State University, Dayton, OH. His research interests include database, data mining, Grid computing, text mining, XML, and parallel and distributed processing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Luo, C., Chung, S.M. A scalable algorithm for mining maximal frequent sequences using a sample. Knowl Inf Syst 15, 149–179 (2008). https://doi.org/10.1007/s10115-006-0056-0

Download citation

Received: 14 April 2005
Revised: 15 September 2006
Accepted: 07 October 2006
Published: 24 January 2007
Issue Date: May 2008
DOI: https://doi.org/10.1007/s10115-006-0056-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A scalable algorithm for mining maximal frequent sequences using a sample

Abstract

Access this article

Similar content being viewed by others

Mining sequential patterns with itemset constraints

SPaMi-FTS: An Efficient Algorithm for Mining Frequent Sequential Patterns

Mining Frequent Sequences Using Itemset-Based Extension

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A scalable algorithm for mining maximal frequent sequences using a sample

Abstract

Access this article

Similar content being viewed by others

Mining sequential patterns with itemset constraints

SPaMi-FTS: An Efficient Algorithm for Mining Frequent Sequential Patterns

Mining Frequent Sequences Using Itemset-Based Extension

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation