research-article

Free Access

Multilingual spectral clustering using document similarity propagation

Authors:
Dani Yogatama

University of Tokyo, Chiyoda-ku, Tokyo, Japan

University of Tokyo, Chiyoda-ku, Tokyo, Japan
View Profile

,
Kumiko Tanaka-Ishii

University of Tokyo, Chiyoda-ku, Tokyo, Japan

University of Tokyo, Chiyoda-ku, Tokyo, Japan
View Profile

EMNLP '09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2August 2009Pages 871–879

Published:06 August 2009Publication History

EMNLP '09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2

Pages 871–879

ABSTRACT

We present a novel approach for multilingual document clustering using only comparable corpora to achieve cross-lingual semantic interoperability. The method models document collections as weighted graph, and supervisory information is given as sets of must-linked constraints for documents in different languages. Recursive k-nearest neighbor similarity propagation is used to exploit the prior knowledge and merge two language spaces. Spectral method is applied to find the best cuts of the graph. Experimental results show that using limited supervisory information, our method achieves promising clustering results. Furthermore, since the method does not need any language dependent information in the process, our algorithm can be applied to languages in various alphabetical systems.

References

Charu C. Aggarwal, Stephen C. Gates and Philip S. Yu. 1999. On The Merits of Building Categorization Systems by Supervised Clustering. In Proceedings of Conference on Knowledge Discovery in Databases:352--356. Google ScholarDigital Library
Hsin-Hsi Chen and Chuan-Jie Lin. 2000. A Multilingual News Summarizer. In Proceedings of 18th International Conference on Computational Linguistics:159--165. Google ScholarDigital Library
Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science:41(6):391--407.Google ScholarCross Ref
Miroslav Fiedler. 1975. A Property of Eigenvectors of Nonnegative Symmetric Matrices and its Applications to Graph Theory. Czechoslovak Mathematical Journal, 25:619--672.Google ScholarCross Ref
Alfio Gliozzo and Carlo Strapparava. 2005. Cross language Text Categorization by acquiring Multilingual Domain Models from Comparable Corpora. In Proceedings of the ACL Workshop on Building and Using Parallel Texts:9--16. Google ScholarDigital Library
Sepandar D. Kamvar, Dan Klein, and Christopher D. Manning. 2003. Spectral Learning. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). Google ScholarDigital Library
Dan Klein, Sepandar D. Kamvar, and Christopher D. Manning. 2002. From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In The Nineteenth International Conference on Machine Learning. Google ScholarDigital Library
Xiaoyong Liu and W. Bruce Croft. 2004. Cluster-based Retrieval using Language Models. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval:186--193. Google ScholarDigital Library
Marinla Meilă and Jianbo Shi. 2000. Learning segmentation by random walks. In Advances in Neural Information Processing Systems:813--819.Google Scholar
Marinla Meilă and Jianbo Shi. 2001. A Random Walks View of Spectral Segmentation. In AI and Statistics (AISTATS).Google Scholar
Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. 2002. On Spectral Clustering: Analysis and an algorithm. In Proceedings of Advances in Neural Information Processing Systems (NIPS 14).Google Scholar
Bruno Pouliquen, Ralf Steinberger, Camelia Ignat, Emilia Käsper, and Irina Temnikova. 2004. Multilingual and Cross-lingual News Topic Tracking. In Proceedings of the 20th International Conference on Computational Linguistics. Google ScholarDigital Library
Stefan Siersdorfer and Sergej Sizov. 2004. Restrictive Clustering and Metaclustering for Self-Organizing Document. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. Google ScholarDigital Library
Kiri Wagstaff and Claire Cardie 2000. Clustering with Instance-level Constraints. In Proceedings of the 17th International Conference on Machine Learning:1103--1110. Google ScholarDigital Library
Chih-Ping Wei, Christopher C. Yang, and Chia-Min Lin. 2008. A Latent Semantic Indexing Based Approach to Multilingual Document Clustering. In Decision Support Systems, 45(3):606--620 Google ScholarDigital Library
Dell Zhang and Robert Mao. 2008. Extracting Community Structure Features for Hypertext Classification. In Proceedings of the 3rd IEEE International Conference on Digital Information Management (ICDIM).Google Scholar

Index Terms

Multilingual spectral clustering using document similarity propagation
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
    2. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Double-pass clustering technique for multilingual document collections

It is often necessary to categorize automatically multilingual document sets, in which documents written in a variety of languages are included, into topically homogeneous subsets, such as when applying an automatic summarization system for multilingual ...
Read More
Multilingual document clustering using wikipedia as external knowledge
IRFC'11: Proceedings of the Second international conference on Multidisciplinary information retrieval facility

This paper presents Multilingual Document Clustering (MDC) on comparable corpora. Wikipedia has evolved to be a major structured multilingual knowledge base. It has been highly exploited in many monolingual clustering approaches and also in comparing ...
Read More
Multilingual document clustering: an heuristic approach based on cognate named entities
ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics

This paper presents an approach for Multilingual Document Clustering in comparable corpora. The algorithm is of heuristic nature and it uses as unique evidence for clustering the identification of cognate named entities between both sides of the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EMNLP '09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
August 2009
616 pages
ISBN:9781932432626
Program Chairs:
Philipp Koehn
University of Edinburgh
,
Rada Mihalcea
University of North Texas
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 6 August 2009
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate73of234submissions,31%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 270
  Total Downloads
- Downloads (Last 12 months)37
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multilingual spectral clustering using document similarity propagation

EMNLP '09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2

ABSTRACT

References

Cited By

Index Terms

Recommendations

Double-pass clustering technique for multilingual document collections

Multilingual document clustering using wikipedia as external knowledge

Multilingual document clustering: an heuristic approach based on cognate named entities

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Multilingual spectral clustering using document similarity propagation

EMNLP '09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2

ABSTRACT

References

Cited By

Index Terms

Recommendations

Double-pass clustering technique for multilingual document collections

Multilingual document clustering using wikipedia as external knowledge

Multilingual document clustering: an heuristic approach based on cognate named entities

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media