Abstract
Due to the inherent flexibilities in both structure and semantics, XML association rules mining faces few challenges, such as: a more complicated hierarchical data structure and ordered data context. Mining frequent patterns from XML documents can be recast as mining frequent tree structures from a database of XML documents. In this study, we model a database of XML documents as a database of rooted labeled ordered subtrees. In particular, we are mainly concerned with mining frequent induced and embedded ordered subtrees. Our main contributions are as follows. We describe our unique embedding list representation of the tree structure, which enables efficient implementation of our Tree Model Guided (TMG) candidate generation. TMG is an optimal, nonredundant enumeration strategy that enumerates all the valid candidates that conform to the structural aspects of the data. We show through a mathematical model and experiments that TMG has better complexity compared to the commonly used join approach. In this article, we propose two algorithms, MB3-Miner and iMB3-Miner. MB3-Miner mines embedded subtrees. iMB3-Miner mines induced and/or embedded subtrees by using the maximum level of embedding constraint. Our experiments with both synthetic and real datasets against two well-known algorithms for mining induced and embedded subtrees, demonstrate the effectiveness and the efficiency of the proposed techniques.
- Abe, K., Kawasoe, S., Asai, T., Arimura, H., and Arikawa, S. 2002. Optimized substructure discovery for semistructured data. In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2002) (Helsinki, Finland). 1--14 Google ScholarDigital Library
- Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data (SIGMOD 1993) (Washington, DC). ACM, New York, 207--216. Google ScholarDigital Library
- Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, A. I. 1996. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, Ramasamy Uthurusamy, Eds. American Association for Artificial Intelligence, CA, 307--328. Google ScholarDigital Library
- Agrawal, R. and Srikant, R. 1994. Fast algorithm for mining association rules. In Proceedings of the 20th Very Large Data Bases (VLDB 1994) (Santiago de Chile, Chile). 487--499. Google ScholarDigital Library
- Bayardo, R. J. 1998. Efficiently mining long patterns from databases. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD 1998) (Seattle, WA). ACM, New York, 85--93. Google ScholarDigital Library
- Chi, Y., Nijssen, S., Muntz, R. R., and Kok. J. N. 2005. Frequent subtree mining an overview. Fundamenta Informaticae, Special Issue on Graph and Tree Mining 65, 1--2, 161--198. Google ScholarDigital Library
- Chi, Y., Yang, Y., and Muntz, R. R. 2004. HybridTreeMiner: An efficient algorihtm for mining frequent rooted trees and free trees using canonical forms. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management (Santorini Island, Greece). 11--20. Google ScholarDigital Library
- Feng, L. and Dillon, T. S. 2004. Mining XML-Enabled association rule with templates. In Proceedings of the 3rd International Workshop on Knowledge Discovery in Inductive Databases (KDID 2004) (Pisa, Italy). 66--88. Google ScholarDigital Library
- Feng, L. and Dillon, T. S. 2005. An XML-Enabled data mining query language XML-DMQL (invited paper). Int. J. Bus. Intel. Data Mining 1, 1, 22--41. Google ScholarDigital Library
- Feng, L., Dillon, T. S., Weigand, H., and Chang, E. 2003. An XML-Enabled association rule framework. In Proceedings of the 14th Database and Expert Systems Applications (DEXA 2003) (Prague, Czech Republic). 88--97.Google Scholar
- Ghoting, A., Buehrer, G., Parthasarathy, S., Kim, D., Nguyen, A., Chen, Y.-K., and Dubey, P. 2005. Cache-conscious frequent pattern mining on a modern processor. In Proceedings of the 31st International Conference on Very Large Database (VLDB) (Trondheim, Norway). 577--588. Google ScholarDigital Library
- Jenkins, B. 1997. Hash functions. Dr. Dobb's J. Sept.Google Scholar
- Kudo, T. 2003. An implementation of FREQT. http://www.chasen.org/~taku/software/freqt/. (Last accessed 1 Jan 2006).Google Scholar
- Kuramochi, M. and Karypis, G. 2004. An efficient algorithm for discovering frequent subgraphs. IEEE Trans. Knowl. Data Eng. 16, 9, 1038--1051. Google ScholarDigital Library
- Luk, R. W., Leong, H., Dillon, T. S., Chan, A. T., Croft, W. B., and Allen, J. 2002. A survey in indexing and searching XML documents. J. Amer. Soc. Inf. Sci. Tech. 53, 6, 415--438. Google ScholarDigital Library
- Nijssen, S. and Kok, J. N. 2003. Efficient discovery of frequent unordered trees. In Proceedings of the 1st International Workshop Mining Graphs, Trees, and Sequences (MGTS-2003) (Dubrovnik, Croatia), 55--64.Google Scholar
- Papakonstantinou, Y. and Vianu, V. 2000. DTD inference for views of XML data. In Proceedings of the 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'00) (Dallas, TX). ACM, New York, 35--46. Google ScholarDigital Library
- Ruckert, U. and Kramer, S. 2004. Frequent free tree discovery in graph data. In Proceedings of the 2004 ACM Symposium on Applied Computing (Nicosia, Cyprus). 564--570. Google ScholarDigital Library
- Sidhu, A. S., Dillon T. S., and Chang, E. 2006. Protein ontology. In Database Modeling in Biology: Practices and Challenges, Z. Ma and J. Y. Chen, Eds. Springer-Verlag, New York, 39--60.Google Scholar
- Sidhu, A. S., Dillon, T. S., Chang, E., and Sidhu, B. S. 2005. Protein ontology: vocabulary for protein data. In Proceedings of the 3rd IEEE International Conference on Information Technology and Applications (ICITA 2005) (Sydney, Australia). IEEE Computer Society Press, Los Alamitos, CA, 465--469. Google ScholarDigital Library
- Suciu, D. 2000. Semistructured data and XML. In Information Organization and Databases: Foundations of Data Organization, K. Tanaka, S. Ghandeharizadeh, and Y. Kambayashi, Eds. Kluwer International Series in Engineering and Computer Science Series, vol. 579. Kluwer Academic Publishers, Norwell, MA, 9--30. Google ScholarDigital Library
- Tan, H. 2008. Tree Model Guided (TMG) enumeration as the basis for mining frequent patterns from XML documents. Ph.D. dissertation. University of Technology Sydney (UTS), Sydney, Australia.Google Scholar
- Tan, H., Dillon, T. S., Feng, L., Chang, E., and Hadzic, F. 2005a. X3-Miner: mining patterns from XML database. In Proceedings of the 6th International Data Mining 2005 (Skiathos, Greece). 287--297.Google Scholar
- Tan, H., Dillon, T. S., Hadzic, F., Feng, L., and Chang, E. 2005b. MB3-Miner: mining eMBedded subTREEs using tree model guided candidate generation. In Proceedings of the 1st International Workshop on Mining Complex Data 2005 in conjunction with ICDM 2005 (Houston, TX). 103--110.Google Scholar
- Tan, H., Dillon, T. S., Hadzic, F., Feng, L., and Chang, E. 2006a. iMB3-Miner: Mining induced/embedded subtrees by constraining the level of embedding. In Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2006) (Singapore). 450--461. Google ScholarDigital Library
- Tan, H., Dillon, T. S., Hadzic, F., Feng, L., and Chang, E. 2006b. SEQUEST: mining frequent subsequences using DMA Strips. In Proceedings of Data Mining and Information Engineering '06 (Prague, Czech Republic, 11--13 July). 315--328.Google Scholar
- Tatikonda, S., Parthasarathy, S., and Kurc, T. 2006. TRIPS and TIDES: new algorithms for tree mining, In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM '06) (Arlington, VA). ACM, New York, 455--464. Google ScholarDigital Library
- Termier, A., Rousset, M.-C., and Sebag, M. 2002. Treefinder: A first step towards XML data mining. In Proceedings of the 2nd IEEE International Conference on Data Mining (ICDM 2002) (Maebashi City, Japan). IEEE Computer Society Press, Los Alamitos, CA, 450--458. Google ScholarDigital Library
- Wan, J. W. and Dobbie, G. 2003. Extracting association rules from XML documents using XQuery. In Proceedings of the 5th ACM International Workshop on Web Information and Data Management. (WIDM '03) (New Orleans, LA). ACM, New York, 94--97. Google ScholarDigital Library
- Wang, C., Hong, M., Pei, J., Zhou, H., Wang, W., and Shi, B. 2004. Efficient pattern-growth methods for frequent tree pattern mining. In Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2004) (Sydney, Australia). 441--451.Google Scholar
- Wang, K. and Liu, H. 1998. Discovering typical structures of documents: a road map approach. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia). ACM, New York, 146--154. Google ScholarDigital Library
- Xiao, Y., Yao, J.-F., Li, Z., and Dunham, M. H. 2003. Efficient data mining for maximal frequent subtrees. In Proceedings of the 3rd Annual IEEE International Conference on Data Mining (ICDM 2003) (Melbourne, FL). IEEE Computer Society Press, Los Alamitos, CA, 379--386. Google ScholarDigital Library
- Yan, X. and Han, J. 2002. gSpan: Graph-based substructure pattern mining. In Proceedings of the 2nd IEEE International Conference on Data Mining (ICDM 2002) (Maebashi City, Japan). IEEE Computer Society Press, Los Alamitos, CA. 721--724. Google ScholarDigital Library
- Yang, L. H., Lee, M. L., and Hsu, W. 2003. Efficient mining of XML query patterns for caching. In Proceedings of the 29th International Very Large Data Bases (VLDB)Conference (Berlin, Germany). 69--80. Google ScholarDigital Library
- Zaki, M. J. 2003. Fast vertical mining using diffsets. In Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Washington, DC, Aug.). ACM, New York. Google ScholarDigital Library
- Zaki, M. J. 2005. Efficiently mining frequent trees in a forest: algorithms and applications. IEEE Trans. Knowl. Data Eng. 17, 8, 1021--1035. Google ScholarDigital Library
- Zhang, J., Ling, T. W., Bruckner, R. M., Tjoa, A. M., and Liu, H. 2004. On efficient and effective association rule mining from XML data. In Proceedings of the 15th International Conference Database and Expert Systems Applications (DEXA 2004) (Zaragoza, Spain). 497--507.Google Scholar
- Zhang, S., Zhang, J., Liu, H., and Wang, W. 2005. XAR-miner: Efficient association rules mining for XML data. In Proceedings of the Fourteenth International World Wide Web Conference (Special interest tracks and posters) (Chiba, Japan). 894--895. Google ScholarDigital Library
Index Terms
- Tree model guided candidate generation for mining frequent subtrees from XML documents
Recommendations
Mining Induced/Embedded Subtrees using the Level of Embedding Constraint
The increasing need for representing information through more complex structures where semantics and relationships among data objects can be more easily expressed has resulted in many semi-structured data sources. Structure comparison among semi-...
Bottom-up discovery of frequent rooted unordered subtrees
In the past decade, XML has emerged as the standard language for information exchanging over the Internet. Due to its tree-structure paradigm, XML is superior for its capability of storing, querying, and manipulating complex data. Therefore, discovering ...
Model guided algorithm for mining unordered embedded subtrees
Large amount of online information is or can be represented using semi-structured documents, such as XML. The information contained in an XML document can be effectively represented using a rooted ordered labeled tree. This has made the frequent pattern ...
Comments