skip to main content
10.5555/1699571.1699626dlproceedingsArticle/Chapter ViewAbstractPublication PagesemnlpConference Proceedingsconference-collections
research-article
Free Access

Multilingual spectral clustering using document similarity propagation

Published:06 August 2009Publication History

ABSTRACT

We present a novel approach for multilingual document clustering using only comparable corpora to achieve cross-lingual semantic interoperability. The method models document collections as weighted graph, and supervisory information is given as sets of must-linked constraints for documents in different languages. Recursive k-nearest neighbor similarity propagation is used to exploit the prior knowledge and merge two language spaces. Spectral method is applied to find the best cuts of the graph. Experimental results show that using limited supervisory information, our method achieves promising clustering results. Furthermore, since the method does not need any language dependent information in the process, our algorithm can be applied to languages in various alphabetical systems.

References

  1. Charu C. Aggarwal, Stephen C. Gates and Philip S. Yu. 1999. On The Merits of Building Categorization Systems by Supervised Clustering. In Proceedings of Conference on Knowledge Discovery in Databases:352--356. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Hsin-Hsi Chen and Chuan-Jie Lin. 2000. A Multilingual News Summarizer. In Proceedings of 18th International Conference on Computational Linguistics:159--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science:41(6):391--407.Google ScholarGoogle ScholarCross RefCross Ref
  4. Miroslav Fiedler. 1975. A Property of Eigenvectors of Nonnegative Symmetric Matrices and its Applications to Graph Theory. Czechoslovak Mathematical Journal, 25:619--672.Google ScholarGoogle ScholarCross RefCross Ref
  5. Alfio Gliozzo and Carlo Strapparava. 2005. Cross language Text Categorization by acquiring Multilingual Domain Models from Comparable Corpora. In Proceedings of the ACL Workshop on Building and Using Parallel Texts:9--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Sepandar D. Kamvar, Dan Klein, and Christopher D. Manning. 2003. Spectral Learning. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Dan Klein, Sepandar D. Kamvar, and Christopher D. Manning. 2002. From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In The Nineteenth International Conference on Machine Learning. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Xiaoyong Liu and W. Bruce Croft. 2004. Cluster-based Retrieval using Language Models. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval:186--193. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Marinla Meilă and Jianbo Shi. 2000. Learning segmentation by random walks. In Advances in Neural Information Processing Systems:813--819.Google ScholarGoogle Scholar
  10. Marinla Meilă and Jianbo Shi. 2001. A Random Walks View of Spectral Segmentation. In AI and Statistics (AISTATS).Google ScholarGoogle Scholar
  11. Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. 2002. On Spectral Clustering: Analysis and an algorithm. In Proceedings of Advances in Neural Information Processing Systems (NIPS 14).Google ScholarGoogle Scholar
  12. Bruno Pouliquen, Ralf Steinberger, Camelia Ignat, Emilia Käsper, and Irina Temnikova. 2004. Multilingual and Cross-lingual News Topic Tracking. In Proceedings of the 20th International Conference on Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Stefan Siersdorfer and Sergej Sizov. 2004. Restrictive Clustering and Metaclustering for Self-Organizing Document. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kiri Wagstaff and Claire Cardie 2000. Clustering with Instance-level Constraints. In Proceedings of the 17th International Conference on Machine Learning:1103--1110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Chih-Ping Wei, Christopher C. Yang, and Chia-Min Lin. 2008. A Latent Semantic Indexing Based Approach to Multilingual Document Clustering. In Decision Support Systems, 45(3):606--620 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Dell Zhang and Robert Mao. 2008. Extracting Community Structure Features for Hypertext Classification. In Proceedings of the 3rd IEEE International Conference on Digital Information Management (ICDIM).Google ScholarGoogle Scholar

Index Terms

  1. Multilingual spectral clustering using document similarity propagation

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image DL Hosted proceedings
              EMNLP '09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
              August 2009
              616 pages
              ISBN:9781932432626

              Publisher

              Association for Computational Linguistics

              United States

              Publication History

              • Published: 6 August 2009

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate73of234submissions,31%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader