research-article

Tree model guided candidate generation for mining frequent subtrees from XML documents

Authors:
Henry Tan

Univ. of Technology Sydney, Australia

Univ. of Technology Sydney, Australia
View Profile

,
Fedja Hadzic

Univ. of Technology Sydney, Australia

Univ. of Technology Sydney, Australia
View Profile

,
Tharam S. Dillon

Univ. of Technology Sydney, Australia

Univ. of Technology Sydney, Australia
View Profile

,
Elizabeth Chang

Curtin Univ. of Technology, Perth, Australia

Curtin Univ. of Technology, Perth, Australia
View Profile

,
Ling Feng

Tsinghua Univ., China

Tsinghua Univ., China
View Profile

ACM Transactions on Knowledge Discovery from Data Volume 2 Issue 2Article No.: 9pp 1–43https://doi.org/10.1145/1376815.1376818

Published:24 July 2008Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

Due to the inherent flexibilities in both structure and semantics, XML association rules mining faces few challenges, such as: a more complicated hierarchical data structure and ordered data context. Mining frequent patterns from XML documents can be recast as mining frequent tree structures from a database of XML documents. In this study, we model a database of XML documents as a database of rooted labeled ordered subtrees. In particular, we are mainly concerned with mining frequent induced and embedded ordered subtrees. Our main contributions are as follows. We describe our unique embedding list representation of the tree structure, which enables efficient implementation of our Tree Model Guided (TMG) candidate generation. TMG is an optimal, nonredundant enumeration strategy that enumerates all the valid candidates that conform to the structural aspects of the data. We show through a mathematical model and experiments that TMG has better complexity compared to the commonly used join approach. In this article, we propose two algorithms, MB3-Miner and iMB3-Miner. MB3-Miner mines embedded subtrees. iMB3-Miner mines induced and/or embedded subtrees by using the maximum level of embedding constraint. Our experiments with both synthetic and real datasets against two well-known algorithms for mining induced and embedded subtrees, demonstrate the effectiveness and the efficiency of the proposed techniques.

References

Abe, K., Kawasoe, S., Asai, T., Arimura, H., and Arikawa, S. 2002. Optimized substructure discovery for semistructured data. In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2002) (Helsinki, Finland). 1--14 Google ScholarDigital Library
Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data (SIGMOD 1993) (Washington, DC). ACM, New York, 207--216. Google ScholarDigital Library
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, A. I. 1996. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, Ramasamy Uthurusamy, Eds. American Association for Artificial Intelligence, CA, 307--328. Google ScholarDigital Library
Agrawal, R. and Srikant, R. 1994. Fast algorithm for mining association rules. In Proceedings of the 20th Very Large Data Bases (VLDB 1994) (Santiago de Chile, Chile). 487--499. Google ScholarDigital Library
Bayardo, R. J. 1998. Efficiently mining long patterns from databases. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD 1998) (Seattle, WA). ACM, New York, 85--93. Google ScholarDigital Library
Chi, Y., Nijssen, S., Muntz, R. R., and Kok. J. N. 2005. Frequent subtree mining an overview. Fundamenta Informaticae, Special Issue on Graph and Tree Mining 65, 1--2, 161--198. Google ScholarDigital Library
Chi, Y., Yang, Y., and Muntz, R. R. 2004. HybridTreeMiner: An efficient algorihtm for mining frequent rooted trees and free trees using canonical forms. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management (Santorini Island, Greece). 11--20. Google ScholarDigital Library
Feng, L. and Dillon, T. S. 2004. Mining XML-Enabled association rule with templates. In Proceedings of the 3rd International Workshop on Knowledge Discovery in Inductive Databases (KDID 2004) (Pisa, Italy). 66--88. Google ScholarDigital Library
Feng, L. and Dillon, T. S. 2005. An XML-Enabled data mining query language XML-DMQL (invited paper). Int. J. Bus. Intel. Data Mining 1, 1, 22--41. Google ScholarDigital Library
Feng, L., Dillon, T. S., Weigand, H., and Chang, E. 2003. An XML-Enabled association rule framework. In Proceedings of the 14th Database and Expert Systems Applications (DEXA 2003) (Prague, Czech Republic). 88--97.Google Scholar
Ghoting, A., Buehrer, G., Parthasarathy, S., Kim, D., Nguyen, A., Chen, Y.-K., and Dubey, P. 2005. Cache-conscious frequent pattern mining on a modern processor. In Proceedings of the 31st International Conference on Very Large Database (VLDB) (Trondheim, Norway). 577--588. Google ScholarDigital Library
Jenkins, B. 1997. Hash functions. Dr. Dobb's J. Sept.Google Scholar
Kudo, T. 2003. An implementation of FREQT. http://www.chasen.org/~taku/software/freqt/. (Last accessed 1 Jan 2006).Google Scholar
Kuramochi, M. and Karypis, G. 2004. An efficient algorithm for discovering frequent subgraphs. IEEE Trans. Knowl. Data Eng. 16, 9, 1038--1051. Google ScholarDigital Library
Luk, R. W., Leong, H., Dillon, T. S., Chan, A. T., Croft, W. B., and Allen, J. 2002. A survey in indexing and searching XML documents. J. Amer. Soc. Inf. Sci. Tech. 53, 6, 415--438. Google ScholarDigital Library
Nijssen, S. and Kok, J. N. 2003. Efficient discovery of frequent unordered trees. In Proceedings of the 1st International Workshop Mining Graphs, Trees, and Sequences (MGTS-2003) (Dubrovnik, Croatia), 55--64.Google Scholar
Papakonstantinou, Y. and Vianu, V. 2000. DTD inference for views of XML data. In Proceedings of the 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'00) (Dallas, TX). ACM, New York, 35--46. Google ScholarDigital Library
Ruckert, U. and Kramer, S. 2004. Frequent free tree discovery in graph data. In Proceedings of the 2004 ACM Symposium on Applied Computing (Nicosia, Cyprus). 564--570. Google ScholarDigital Library
Sidhu, A. S., Dillon T. S., and Chang, E. 2006. Protein ontology. In Database Modeling in Biology: Practices and Challenges, Z. Ma and J. Y. Chen, Eds. Springer-Verlag, New York, 39--60.Google Scholar
Sidhu, A. S., Dillon, T. S., Chang, E., and Sidhu, B. S. 2005. Protein ontology: vocabulary for protein data. In Proceedings of the 3rd IEEE International Conference on Information Technology and Applications (ICITA 2005) (Sydney, Australia). IEEE Computer Society Press, Los Alamitos, CA, 465--469. Google ScholarDigital Library
Suciu, D. 2000. Semistructured data and XML. In Information Organization and Databases: Foundations of Data Organization, K. Tanaka, S. Ghandeharizadeh, and Y. Kambayashi, Eds. Kluwer International Series in Engineering and Computer Science Series, vol. 579. Kluwer Academic Publishers, Norwell, MA, 9--30. Google ScholarDigital Library
Tan, H. 2008. Tree Model Guided (TMG) enumeration as the basis for mining frequent patterns from XML documents. Ph.D. dissertation. University of Technology Sydney (UTS), Sydney, Australia.Google Scholar
Tan, H., Dillon, T. S., Feng, L., Chang, E., and Hadzic, F. 2005a. X3-Miner: mining patterns from XML database. In Proceedings of the 6th International Data Mining 2005 (Skiathos, Greece). 287--297.Google Scholar
Tan, H., Dillon, T. S., Hadzic, F., Feng, L., and Chang, E. 2005b. MB3-Miner: mining eMBedded subTREEs using tree model guided candidate generation. In Proceedings of the 1st International Workshop on Mining Complex Data 2005 in conjunction with ICDM 2005 (Houston, TX). 103--110.Google Scholar
Tan, H., Dillon, T. S., Hadzic, F., Feng, L., and Chang, E. 2006a. iMB3-Miner: Mining induced/embedded subtrees by constraining the level of embedding. In Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2006) (Singapore). 450--461. Google ScholarDigital Library
Tan, H., Dillon, T. S., Hadzic, F., Feng, L., and Chang, E. 2006b. SEQUEST: mining frequent subsequences using DMA Strips. In Proceedings of Data Mining and Information Engineering '06 (Prague, Czech Republic, 11--13 July). 315--328.Google Scholar
Tatikonda, S., Parthasarathy, S., and Kurc, T. 2006. TRIPS and TIDES: new algorithms for tree mining, In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM '06) (Arlington, VA). ACM, New York, 455--464. Google ScholarDigital Library
Termier, A., Rousset, M.-C., and Sebag, M. 2002. Treefinder: A first step towards XML data mining. In Proceedings of the 2nd IEEE International Conference on Data Mining (ICDM 2002) (Maebashi City, Japan). IEEE Computer Society Press, Los Alamitos, CA, 450--458. Google ScholarDigital Library
Wan, J. W. and Dobbie, G. 2003. Extracting association rules from XML documents using XQuery. In Proceedings of the 5th ACM International Workshop on Web Information and Data Management. (WIDM '03) (New Orleans, LA). ACM, New York, 94--97. Google ScholarDigital Library
Wang, C., Hong, M., Pei, J., Zhou, H., Wang, W., and Shi, B. 2004. Efficient pattern-growth methods for frequent tree pattern mining. In Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2004) (Sydney, Australia). 441--451.Google Scholar
Wang, K. and Liu, H. 1998. Discovering typical structures of documents: a road map approach. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia). ACM, New York, 146--154. Google ScholarDigital Library
Xiao, Y., Yao, J.-F., Li, Z., and Dunham, M. H. 2003. Efficient data mining for maximal frequent subtrees. In Proceedings of the 3rd Annual IEEE International Conference on Data Mining (ICDM 2003) (Melbourne, FL). IEEE Computer Society Press, Los Alamitos, CA, 379--386. Google ScholarDigital Library
Yan, X. and Han, J. 2002. gSpan: Graph-based substructure pattern mining. In Proceedings of the 2nd IEEE International Conference on Data Mining (ICDM 2002) (Maebashi City, Japan). IEEE Computer Society Press, Los Alamitos, CA. 721--724. Google ScholarDigital Library
Yang, L. H., Lee, M. L., and Hsu, W. 2003. Efficient mining of XML query patterns for caching. In Proceedings of the 29th International Very Large Data Bases (VLDB)Conference (Berlin, Germany). 69--80. Google ScholarDigital Library
Zaki, M. J. 2003. Fast vertical mining using diffsets. In Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Washington, DC, Aug.). ACM, New York. Google ScholarDigital Library
Zaki, M. J. 2005. Efficiently mining frequent trees in a forest: algorithms and applications. IEEE Trans. Knowl. Data Eng. 17, 8, 1021--1035. Google ScholarDigital Library
Zhang, J., Ling, T. W., Bruckner, R. M., Tjoa, A. M., and Liu, H. 2004. On efficient and effective association rule mining from XML data. In Proceedings of the 15th International Conference Database and Expert Systems Applications (DEXA 2004) (Zaragoza, Spain). 497--507.Google Scholar
Zhang, S., Zhang, J., Liu, H., and Wang, W. 2005. XAR-miner: Efficient association rules mining for XML data. In Proceedings of the Fourteenth International World Wide Web Conference (Special interest tracks and posters) (Chiba, Japan). 894--895. Google ScholarDigital Library

Index Terms

Tree model guided candidate generation for mining frequent subtrees from XML documents
1. Applied computing
  1. Document management and text processing
2. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
    2. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

Mining Induced/Embedded Subtrees using the Level of Embedding Constraint

The increasing need for representing information through more complex structures where semantics and relationships among data objects can be more easily expressed has resulted in many semi-structured data sources. Structure comparison among semi-...
Read More
Bottom-up discovery of frequent rooted unordered subtrees

In the past decade, XML has emerged as the standard language for information exchanging over the Internet. Due to its tree-structure paradigm, XML is superior for its capability of storing, querying, and manipulating complex data. Therefore, discovering ...
Read More
Model guided algorithm for mining unordered embedded subtrees

Large amount of online information is or can be represented using semi-structured documents, such as XML. The information contained in an XML document can be effectively represented using a rooted ordered labeled tree. This has made the frequent pattern ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Knowledge Discovery from Data Volume 2, Issue 2
July 2008
152 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/1376815
Issue’s Table of Contents

Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 July 2008
- Accepted: 1 April 2008
- Revised: 1 November 2007
- Received: 1 March 2007
Published in tkdd Volume 2, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
FREQT
TMG
Tree mining
TreeMiner
tree model guided
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 33
  Total Citations
  View Citations
- 907
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Tree model guided candidate generation for mining frequent subtrees from XML documents

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Mining Induced/Embedded Subtrees using the Level of Embedding Constraint

Bottom-up discovery of frequent rooted unordered subtrees

Model guided algorithm for mining unordered embedded subtrees

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Tree model guided candidate generation for mining frequent subtrees from XML documents

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Mining Induced/Embedded Subtrees using the Level of Embedding Constraint

Bottom-up discovery of frequent rooted unordered subtrees

Model guided algorithm for mining unordered embedded subtrees

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media