ABSTRACT
XML documents have recently become ubiquitous because of their varied applicability in a number of applications. Classification is an important problem in the data mining domain, but current classification methods for XML documents use IR-based methods in which each document is treated as a bag of words. Such techniques ignore a significant amount of information hidden inside the documents. In this paper we discuss the problem of rule based classification of XML data by using frequent discriminatory substructures within XML documents. Such a technique is more capable of finding the classification characteristics of documents. In addition, the technique can also be extended to cost sensitive classification. We show the effectiveness of the method with respect to other classifiers. We note that the methodology discussed in this paper is applicable to any kind of semi-structured data.
- C. C. Aggarwal. On Effective Classification of Strings with Wavelets. SIGKDD, 2002.]] Google ScholarDigital Library
- C. Aggarwal, S. Gates, P. Yu. On the merits of using supervised clustering to build categorization systems. SIGKDD, 1999.]]Google Scholar
- R. Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules. VLDB Conference, 1994.]] Google ScholarDigital Library
- K. Alsabti, S. Ranka, V. Singh. CLOUDS: A Decision Tree Classifier for Large Datasets. SIGKDD, 1998.]]Google Scholar
- R. Andersen et al. Professional XML. Wrox Press Ltd, 2002.]]Google Scholar
- T. Asai, et al. Efficient substructure discovery from large semi-structured data. 2nd SIAM Int'l Conference on Data Mining, 2002.]]Google ScholarDigital Library
- W. W. Cohen. Fast Effective Rule Induction. Int'l Conf. Machine Learning, 1995.]]Google Scholar
- P. Domingos. MetaCost: A general method for making classifiers cost sensitive. SIGKDD, 1999.]] Google ScholarDigital Library
- G. Dong, X. Zhang, L. Wong, J. Li. CAEP: Classification by Aggregating Emerging Patterns. Int'l Conference on Discovery Science, 1999.]] Google ScholarDigital Library
- R. Duda, P. Hart. Pattern Classification and Scene Analysis, Wiley, New York, 1973.]]Google ScholarDigital Library
- J. Gehrke, v. Ganti, R. Ramakrishnan, W.-Y. Loh. BOAT: Optimistic Decision Tree Construction. SIGMOD, 1999.]] Google ScholarDigital Library
- M. James. Classification Algorithms, Wiley, 1985.]] Google ScholarDigital Library
- W. Li, J. Han, J. Pei. CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules. IEEE Int'l Conf. on Data Mining, 2001.]] Google ScholarDigital Library
- J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.]] Google ScholarDigital Library
- R. Rastogi, K. Shim. PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning. VLDB, 1998.]] Google ScholarDigital Library
- B. Liu, W. Hsu, Y. Ma. Integrating Classification and Association Rule Mining. SIGKDD, 1998.]]Google Scholar
- K. Nigam, A. K. McCallum, S. Thrum, T. Mitchell. Text Classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103--134, 2000.]] Google ScholarDigital Library
- J. Punin, M. Krishnamoorthy, M. Zaki. LOGML: Log markup language for web usage mining. In WEBKDD Workshop (with SIGKDD), August 2001.]] Google ScholarDigital Library
- A. Termier, M-C. Rousset, M. Sebag. TreeFinder: a First Step towards XML Data Mining. IEEE Int'l Conf. on Data Mining, 2002.]] Google ScholarDigital Library
- K. Wang, H. Q. Liu. Discovering Typical Structures of Documents: A Road Map Approach. SIGIR, 1998.]] Google ScholarDigital Library
- M. J. Zaki. Efficiently Mining Frequent Trees in a Forest. SIGKDD, 2002.]] Google ScholarDigital Library
Index Terms
- XRules: an effective structural classifier for XML data
Recommendations
XRules: An effective algorithm for structural classification of XML data
XML documents have recently become ubiquitous because of their varied applicability in a number of applications. Classification is an important problem in the data mining domain, but current classification methods for XML documents use IR-based methods ...
PORSCHE: Performance ORiented SCHEma mediation
Semantic matching of schemas in heterogeneous data sharing systems is time consuming and error prone. Existing mapping tools employ semi-automatic techniques for mapping two schemas at a time. In a large-scale scenario, where data sharing involves a ...
FFTM: optimized frequent tree mining with soft embedding constraints on siblings
CSTST '08: Proceedings of the 5th international conference on Soft computing as transdisciplinary science and technologyDatabases have become increasingly large and the data they contain is increasingly bulky. Thus the problem of knowledge extraction has become very significant and requires multiple techniques for processing the data available in order to extract the ...
Comments