ABSTRACT
We present a novel approach for multilingual document clustering using only comparable corpora to achieve cross-lingual semantic interoperability. The method models document collections as weighted graph, and supervisory information is given as sets of must-linked constraints for documents in different languages. Recursive k-nearest neighbor similarity propagation is used to exploit the prior knowledge and merge two language spaces. Spectral method is applied to find the best cuts of the graph. Experimental results show that using limited supervisory information, our method achieves promising clustering results. Furthermore, since the method does not need any language dependent information in the process, our algorithm can be applied to languages in various alphabetical systems.
- Charu C. Aggarwal, Stephen C. Gates and Philip S. Yu. 1999. On The Merits of Building Categorization Systems by Supervised Clustering. In Proceedings of Conference on Knowledge Discovery in Databases:352--356. Google ScholarDigital Library
- Hsin-Hsi Chen and Chuan-Jie Lin. 2000. A Multilingual News Summarizer. In Proceedings of 18th International Conference on Computational Linguistics:159--165. Google ScholarDigital Library
- Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science:41(6):391--407.Google ScholarCross Ref
- Miroslav Fiedler. 1975. A Property of Eigenvectors of Nonnegative Symmetric Matrices and its Applications to Graph Theory. Czechoslovak Mathematical Journal, 25:619--672.Google ScholarCross Ref
- Alfio Gliozzo and Carlo Strapparava. 2005. Cross language Text Categorization by acquiring Multilingual Domain Models from Comparable Corpora. In Proceedings of the ACL Workshop on Building and Using Parallel Texts:9--16. Google ScholarDigital Library
- Sepandar D. Kamvar, Dan Klein, and Christopher D. Manning. 2003. Spectral Learning. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). Google ScholarDigital Library
- Dan Klein, Sepandar D. Kamvar, and Christopher D. Manning. 2002. From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In The Nineteenth International Conference on Machine Learning. Google ScholarDigital Library
- Xiaoyong Liu and W. Bruce Croft. 2004. Cluster-based Retrieval using Language Models. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval:186--193. Google ScholarDigital Library
- Marinla Meilă and Jianbo Shi. 2000. Learning segmentation by random walks. In Advances in Neural Information Processing Systems:813--819.Google Scholar
- Marinla Meilă and Jianbo Shi. 2001. A Random Walks View of Spectral Segmentation. In AI and Statistics (AISTATS).Google Scholar
- Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. 2002. On Spectral Clustering: Analysis and an algorithm. In Proceedings of Advances in Neural Information Processing Systems (NIPS 14).Google Scholar
- Bruno Pouliquen, Ralf Steinberger, Camelia Ignat, Emilia Käsper, and Irina Temnikova. 2004. Multilingual and Cross-lingual News Topic Tracking. In Proceedings of the 20th International Conference on Computational Linguistics. Google ScholarDigital Library
- Stefan Siersdorfer and Sergej Sizov. 2004. Restrictive Clustering and Metaclustering for Self-Organizing Document. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. Google ScholarDigital Library
- Kiri Wagstaff and Claire Cardie 2000. Clustering with Instance-level Constraints. In Proceedings of the 17th International Conference on Machine Learning:1103--1110. Google ScholarDigital Library
- Chih-Ping Wei, Christopher C. Yang, and Chia-Min Lin. 2008. A Latent Semantic Indexing Based Approach to Multilingual Document Clustering. In Decision Support Systems, 45(3):606--620 Google ScholarDigital Library
- Dell Zhang and Robert Mao. 2008. Extracting Community Structure Features for Hypertext Classification. In Proceedings of the 3rd IEEE International Conference on Digital Information Management (ICDIM).Google Scholar
Index Terms
- Multilingual spectral clustering using document similarity propagation
Recommendations
Double-pass clustering technique for multilingual document collections
It is often necessary to categorize automatically multilingual document sets, in which documents written in a variety of languages are included, into topically homogeneous subsets, such as when applying an automatic summarization system for multilingual ...
Multilingual document clustering using wikipedia as external knowledge
IRFC'11: Proceedings of the Second international conference on Multidisciplinary information retrieval facilityThis paper presents Multilingual Document Clustering (MDC) on comparable corpora. Wikipedia has evolved to be a major structured multilingual knowledge base. It has been highly exploited in many monolingual clustering approaches and also in comparing ...
Multilingual document clustering: an heuristic approach based on cognate named entities
ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational LinguisticsThis paper presents an approach for Multilingual Document Clustering in comparable corpora. The algorithm is of heuristic nature and it uses as unique evidence for clustering the identification of cognate named entities between both sides of the ...
Comments