skip to main content
research-article

Tree model guided candidate generation for mining frequent subtrees from XML documents

Published:24 July 2008Publication History
Skip Abstract Section

Abstract

Due to the inherent flexibilities in both structure and semantics, XML association rules mining faces few challenges, such as: a more complicated hierarchical data structure and ordered data context. Mining frequent patterns from XML documents can be recast as mining frequent tree structures from a database of XML documents. In this study, we model a database of XML documents as a database of rooted labeled ordered subtrees. In particular, we are mainly concerned with mining frequent induced and embedded ordered subtrees. Our main contributions are as follows. We describe our unique embedding list representation of the tree structure, which enables efficient implementation of our Tree Model Guided (TMG) candidate generation. TMG is an optimal, nonredundant enumeration strategy that enumerates all the valid candidates that conform to the structural aspects of the data. We show through a mathematical model and experiments that TMG has better complexity compared to the commonly used join approach. In this article, we propose two algorithms, MB3-Miner and iMB3-Miner. MB3-Miner mines embedded subtrees. iMB3-Miner mines induced and/or embedded subtrees by using the maximum level of embedding constraint. Our experiments with both synthetic and real datasets against two well-known algorithms for mining induced and embedded subtrees, demonstrate the effectiveness and the efficiency of the proposed techniques.

References

  1. Abe, K., Kawasoe, S., Asai, T., Arimura, H., and Arikawa, S. 2002. Optimized substructure discovery for semistructured data. In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2002) (Helsinki, Finland). 1--14 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data (SIGMOD 1993) (Washington, DC). ACM, New York, 207--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, A. I. 1996. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, Ramasamy Uthurusamy, Eds. American Association for Artificial Intelligence, CA, 307--328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Agrawal, R. and Srikant, R. 1994. Fast algorithm for mining association rules. In Proceedings of the 20th Very Large Data Bases (VLDB 1994) (Santiago de Chile, Chile). 487--499. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bayardo, R. J. 1998. Efficiently mining long patterns from databases. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD 1998) (Seattle, WA). ACM, New York, 85--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chi, Y., Nijssen, S., Muntz, R. R., and Kok. J. N. 2005. Frequent subtree mining an overview. Fundamenta Informaticae, Special Issue on Graph and Tree Mining 65, 1--2, 161--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chi, Y., Yang, Y., and Muntz, R. R. 2004. HybridTreeMiner: An efficient algorihtm for mining frequent rooted trees and free trees using canonical forms. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management (Santorini Island, Greece). 11--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Feng, L. and Dillon, T. S. 2004. Mining XML-Enabled association rule with templates. In Proceedings of the 3rd International Workshop on Knowledge Discovery in Inductive Databases (KDID 2004) (Pisa, Italy). 66--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Feng, L. and Dillon, T. S. 2005. An XML-Enabled data mining query language XML-DMQL (invited paper). Int. J. Bus. Intel. Data Mining 1, 1, 22--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Feng, L., Dillon, T. S., Weigand, H., and Chang, E. 2003. An XML-Enabled association rule framework. In Proceedings of the 14th Database and Expert Systems Applications (DEXA 2003) (Prague, Czech Republic). 88--97.Google ScholarGoogle Scholar
  11. Ghoting, A., Buehrer, G., Parthasarathy, S., Kim, D., Nguyen, A., Chen, Y.-K., and Dubey, P. 2005. Cache-conscious frequent pattern mining on a modern processor. In Proceedings of the 31st International Conference on Very Large Database (VLDB) (Trondheim, Norway). 577--588. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jenkins, B. 1997. Hash functions. Dr. Dobb's J. Sept.Google ScholarGoogle Scholar
  13. Kudo, T. 2003. An implementation of FREQT. http://www.chasen.org/~taku/software/freqt/. (Last accessed 1 Jan 2006).Google ScholarGoogle Scholar
  14. Kuramochi, M. and Karypis, G. 2004. An efficient algorithm for discovering frequent subgraphs. IEEE Trans. Knowl. Data Eng. 16, 9, 1038--1051. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Luk, R. W., Leong, H., Dillon, T. S., Chan, A. T., Croft, W. B., and Allen, J. 2002. A survey in indexing and searching XML documents. J. Amer. Soc. Inf. Sci. Tech. 53, 6, 415--438. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Nijssen, S. and Kok, J. N. 2003. Efficient discovery of frequent unordered trees. In Proceedings of the 1st International Workshop Mining Graphs, Trees, and Sequences (MGTS-2003) (Dubrovnik, Croatia), 55--64.Google ScholarGoogle Scholar
  17. Papakonstantinou, Y. and Vianu, V. 2000. DTD inference for views of XML data. In Proceedings of the 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'00) (Dallas, TX). ACM, New York, 35--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ruckert, U. and Kramer, S. 2004. Frequent free tree discovery in graph data. In Proceedings of the 2004 ACM Symposium on Applied Computing (Nicosia, Cyprus). 564--570. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Sidhu, A. S., Dillon T. S., and Chang, E. 2006. Protein ontology. In Database Modeling in Biology: Practices and Challenges, Z. Ma and J. Y. Chen, Eds. Springer-Verlag, New York, 39--60.Google ScholarGoogle Scholar
  20. Sidhu, A. S., Dillon, T. S., Chang, E., and Sidhu, B. S. 2005. Protein ontology: vocabulary for protein data. In Proceedings of the 3rd IEEE International Conference on Information Technology and Applications (ICITA 2005) (Sydney, Australia). IEEE Computer Society Press, Los Alamitos, CA, 465--469. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Suciu, D. 2000. Semistructured data and XML. In Information Organization and Databases: Foundations of Data Organization, K. Tanaka, S. Ghandeharizadeh, and Y. Kambayashi, Eds. Kluwer International Series in Engineering and Computer Science Series, vol. 579. Kluwer Academic Publishers, Norwell, MA, 9--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Tan, H. 2008. Tree Model Guided (TMG) enumeration as the basis for mining frequent patterns from XML documents. Ph.D. dissertation. University of Technology Sydney (UTS), Sydney, Australia.Google ScholarGoogle Scholar
  23. Tan, H., Dillon, T. S., Feng, L., Chang, E., and Hadzic, F. 2005a. X3-Miner: mining patterns from XML database. In Proceedings of the 6th International Data Mining 2005 (Skiathos, Greece). 287--297.Google ScholarGoogle Scholar
  24. Tan, H., Dillon, T. S., Hadzic, F., Feng, L., and Chang, E. 2005b. MB3-Miner: mining eMBedded subTREEs using tree model guided candidate generation. In Proceedings of the 1st International Workshop on Mining Complex Data 2005 in conjunction with ICDM 2005 (Houston, TX). 103--110.Google ScholarGoogle Scholar
  25. Tan, H., Dillon, T. S., Hadzic, F., Feng, L., and Chang, E. 2006a. iMB3-Miner: Mining induced/embedded subtrees by constraining the level of embedding. In Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2006) (Singapore). 450--461. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Tan, H., Dillon, T. S., Hadzic, F., Feng, L., and Chang, E. 2006b. SEQUEST: mining frequent subsequences using DMA Strips. In Proceedings of Data Mining and Information Engineering '06 (Prague, Czech Republic, 11--13 July). 315--328.Google ScholarGoogle Scholar
  27. Tatikonda, S., Parthasarathy, S., and Kurc, T. 2006. TRIPS and TIDES: new algorithms for tree mining, In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM '06) (Arlington, VA). ACM, New York, 455--464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Termier, A., Rousset, M.-C., and Sebag, M. 2002. Treefinder: A first step towards XML data mining. In Proceedings of the 2nd IEEE International Conference on Data Mining (ICDM 2002) (Maebashi City, Japan). IEEE Computer Society Press, Los Alamitos, CA, 450--458. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Wan, J. W. and Dobbie, G. 2003. Extracting association rules from XML documents using XQuery. In Proceedings of the 5th ACM International Workshop on Web Information and Data Management. (WIDM '03) (New Orleans, LA). ACM, New York, 94--97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Wang, C., Hong, M., Pei, J., Zhou, H., Wang, W., and Shi, B. 2004. Efficient pattern-growth methods for frequent tree pattern mining. In Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2004) (Sydney, Australia). 441--451.Google ScholarGoogle Scholar
  31. Wang, K. and Liu, H. 1998. Discovering typical structures of documents: a road map approach. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia). ACM, New York, 146--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Xiao, Y., Yao, J.-F., Li, Z., and Dunham, M. H. 2003. Efficient data mining for maximal frequent subtrees. In Proceedings of the 3rd Annual IEEE International Conference on Data Mining (ICDM 2003) (Melbourne, FL). IEEE Computer Society Press, Los Alamitos, CA, 379--386. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Yan, X. and Han, J. 2002. gSpan: Graph-based substructure pattern mining. In Proceedings of the 2nd IEEE International Conference on Data Mining (ICDM 2002) (Maebashi City, Japan). IEEE Computer Society Press, Los Alamitos, CA. 721--724. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Yang, L. H., Lee, M. L., and Hsu, W. 2003. Efficient mining of XML query patterns for caching. In Proceedings of the 29th International Very Large Data Bases (VLDB)Conference (Berlin, Germany). 69--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Zaki, M. J. 2003. Fast vertical mining using diffsets. In Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Washington, DC, Aug.). ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Zaki, M. J. 2005. Efficiently mining frequent trees in a forest: algorithms and applications. IEEE Trans. Knowl. Data Eng. 17, 8, 1021--1035. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Zhang, J., Ling, T. W., Bruckner, R. M., Tjoa, A. M., and Liu, H. 2004. On efficient and effective association rule mining from XML data. In Proceedings of the 15th International Conference Database and Expert Systems Applications (DEXA 2004) (Zaragoza, Spain). 497--507.Google ScholarGoogle Scholar
  38. Zhang, S., Zhang, J., Liu, H., and Wang, W. 2005. XAR-miner: Efficient association rules mining for XML data. In Proceedings of the Fourteenth International World Wide Web Conference (Special interest tracks and posters) (Chiba, Japan). 894--895. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Tree model guided candidate generation for mining frequent subtrees from XML documents

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Knowledge Discovery from Data
            ACM Transactions on Knowledge Discovery from Data  Volume 2, Issue 2
            July 2008
            152 pages
            ISSN:1556-4681
            EISSN:1556-472X
            DOI:10.1145/1376815
            Issue’s Table of Contents

            Copyright © 2008 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 24 July 2008
            • Accepted: 1 April 2008
            • Revised: 1 November 2007
            • Received: 1 March 2007
            Published in tkdd Volume 2, Issue 2

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader