Abstract
This paper reports on the experiments and results of a clustering approach used in the INEX 2008 document mining challenge. The clustering approach utilizes both the structure and content information of the Wikipedia XML document collection. A latent semantic kernel (LSK) is used to measure the semantic similarity between XML documents based on their content features. The construction of a latent semantic kernel involves the computing of singular vector decomposition (SVD). On a large feature space matrix, the computation of SVD is very expensive in terms of time and memory requirements. Thus in this clustering approach, the dimension of the document space of a term-document matrix is reduced before performing SVD. The document space reduction is based on the common structural information of the Wikipedia XML document collection. The proposed clustering approach has shown to be effective on the Wikipedia collection in the INEX 2008 document mining challenge.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Diego (2001)
Kurgan, L., Swiercz, W., Cios, K.J.: Semantic mapping of xml tags using inductive machine learning. In: CIKM 2002, Virginia, USA (2002)
Shen, Y., Wang, B.: Clustering schemaless xml document. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds.) CoopIS 2003, DOA 2003, and ODBASE 2003. LNCS, vol. 2888, pp. 767–784. Springer, Heidelberg (2003)
Nayak, R., Tran, T.: A Progressive Clustering Algorithm to Group the XML Data by Structural and Semantic Similarity. IJPRAI 21(3), 1–21 (2007)
Nayak, R., Xu, S.: XCLS: A Fast and Effective Clustering Algorithm for Heterogenous XML Documents. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS, vol. 3918, pp. 292–302. Springer, Heidelberg (2006)
Doucet, A., Lehtonen, M.: Unsupervised classification of text-centric xml document collections. In: INEX 2006, pp. 497–509 (2006)
Yao, J., Zerida, N.: Rare patterns to improve path-based clustering. In: INEX 2007, Dagstuhl Castle, Germany, December 17-19 (2007)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Book Co., New York (1989)
Garcia, E.: Description, Advantages and Limitations of the Classic Vector Space Model (2006)
Cristianini, N., Shawe-Taylor, J., Lodhi, H.: Latent semantic kernels. JJIS 2002 18(2) (2002)
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Processes, 259–284 (1998)
Karypis, G.: Cluto – software for clustering high-dimensional datasets – karypis lab
Sparck, J.K., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: Development and comparative experiments. IP&M 36(6), 779–808, 809–840
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tran, T., Kutty, S., Nayak, R. (2009). Utilizing the Structure and Content Information for XML Document Clustering. In: Geva, S., Kamps, J., Trotman, A. (eds) Advances in Focused Retrieval. INEX 2008. Lecture Notes in Computer Science, vol 5631. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03761-0_48
Download citation
DOI: https://doi.org/10.1007/978-3-642-03761-0_48
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03760-3
Online ISBN: 978-3-642-03761-0
eBook Packages: Computer ScienceComputer Science (R0)