Utilizing the Structure and Content Information for XML Document Clustering

Tran, Tien; Kutty, Sangeetha; Nayak, Richi

doi:10.1007/978-3-642-03761-0_48

Tien Tran¹⁹,
Sangeetha Kutty¹⁹ &
Richi Nayak¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5631))

Included in the following conference series:

International Workshop of the Initiative for the Evaluation of XML Retrieval

405 Accesses
4 Citations

Abstract

This paper reports on the experiments and results of a clustering approach used in the INEX 2008 document mining challenge. The clustering approach utilizes both the structure and content information of the Wikipedia XML document collection. A latent semantic kernel (LSK) is used to measure the semantic similarity between XML documents based on their content features. The construction of a latent semantic kernel involves the computing of singular vector decomposition (SVD). On a large feature space matrix, the computation of SVD is very expensive in terms of time and memory requirements. Thus in this clustering approach, the dimension of the document space of a term-document matrix is reduced before performing SVD. The document space reduction is based on the common structural information of the Wikipedia XML document collection. The proposed clustering approach has shown to be effective on the Wikipedia collection in the INEX 2008 document mining challenge.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Diego (2001)
MATH Google Scholar
Kurgan, L., Swiercz, W., Cios, K.J.: Semantic mapping of xml tags using inductive machine learning. In: CIKM 2002, Virginia, USA (2002)
Google Scholar
Shen, Y., Wang, B.: Clustering schemaless xml document. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds.) CoopIS 2003, DOA 2003, and ODBASE 2003. LNCS, vol. 2888, pp. 767–784. Springer, Heidelberg (2003)
Chapter Google Scholar
Nayak, R., Tran, T.: A Progressive Clustering Algorithm to Group the XML Data by Structural and Semantic Similarity. IJPRAI 21(3), 1–21 (2007)
Google Scholar
Nayak, R., Xu, S.: XCLS: A Fast and Effective Clustering Algorithm for Heterogenous XML Documents. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS, vol. 3918, pp. 292–302. Springer, Heidelberg (2006)
Chapter Google Scholar
Doucet, A., Lehtonen, M.: Unsupervised classification of text-centric xml document collections. In: INEX 2006, pp. 497–509 (2006)
Google Scholar
Yao, J., Zerida, N.: Rare patterns to improve path-based clustering. In: INEX 2007, Dagstuhl Castle, Germany, December 17-19 (2007)
Google Scholar
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Book Co., New York (1989)
MATH Google Scholar
Garcia, E.: Description, Advantages and Limitations of the Classic Vector Space Model (2006)
Google Scholar
Cristianini, N., Shawe-Taylor, J., Lodhi, H.: Latent semantic kernels. JJIS 2002 18(2) (2002)
Google Scholar
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Processes, 259–284 (1998)
Google Scholar
Karypis, G.: Cluto – software for clustering high-dimensional datasets – karypis lab
Google Scholar
Sparck, J.K., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: Development and comparative experiments. IP&M 36(6), 779–808, 809–840
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Science and Technology, Queensland University of Technology, GPO Box 2434, Brisbane, Qld, 4001, Australia
Tien Tran, Sangeetha Kutty & Richi Nayak

Authors

Tien Tran
View author publications
You can also search for this author in PubMed Google Scholar
Sangeetha Kutty
View author publications
You can also search for this author in PubMed Google Scholar
Richi Nayak
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Science and Technology, Queensland University of Technology, GPO Box 2434, 4001, Brisband, Qld, Australia
Shlomo Geva
Archives and Information Studies/Humanities, University of Amsterdam, Turfdraagsterpad 9, 1012 XT, Amsterdam, The Netherlands
Jaap Kamps
Department of Computer Science, University of Otago, P.O. Box 56, 9054, Dunedin, New Zealand
Andrew Trotman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tran, T., Kutty, S., Nayak, R. (2009). Utilizing the Structure and Content Information for XML Document Clustering. In: Geva, S., Kamps, J., Trotman, A. (eds) Advances in Focused Retrieval. INEX 2008. Lecture Notes in Computer Science, vol 5631. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03761-0_48

Download citation

DOI: https://doi.org/10.1007/978-3-642-03761-0_48
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03760-3
Online ISBN: 978-3-642-03761-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics