ABSTRACT
We present a geometric view on bilingual lexicon extraction from comparable corpora, which allows to re-interpret the methods proposed so far and identify unresolved problems. This motivates three new methods that aim at solving these problems. Empirical evaluation shows the strengths and weaknesses of these methods, as well as a significant gain in the accuracy of extracted lexicons.
- F. R. Bach and M. I. Jordan. 2001. Kernel independent component analysis. Journal of Machine Learning Research. Google ScholarDigital Library
- R. Besançon, M. Rajman, and J.-C. Chappelier. 1999. Textual similarities based on a distributional approach. In Proceedings of the Tenth International Workshop on Database and Expert Systems Applications (DEX'99), Florence, Italy. Google ScholarDigital Library
- S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407.Google ScholarCross Ref
- H. Dejean, E. Gaussier, and F. Sadat. 2002. An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In International Conference on Computational Linguistics, COLING'02. Google ScholarDigital Library
- A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1): 1--38.Google Scholar
- Mona Diab and Steve Finch. 2000. A statistical word-level translation model for comparable corpora. In Proceeding of the Conference on Content-Based Multimedia Information Access (RIAO).Google ScholarCross Ref
- Pascale Fung. 2000. A statistical view on bilingual lexicon extraction - from parallel corpora to nonparallel corpora. In J. Véronis, editor, Parallel Text Processing. Kluwer Academic Publishers.Google Scholar
- G. Grefenstette. 1994. Explorations in Automatic Thesaurus Construction. Kluwer Academic Publishers. Google ScholarDigital Library
- Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pages 289--296. Morgan Kaufmann. Google ScholarDigital Library
- Thomas Hofmann. 2000. Learning the similarity of documents: An information-geometric approach to document retrieval and categorization. In Advances in Neural Information Processing Systems 12, page 914. MIT Press.Google Scholar
- Tommi S. Jaakkola and David Haussler. 1999. Exploiting generative models in discriminative classifiers. In Advances in Neural Information Processing Systems 11, pages 487--493. Google ScholarDigital Library
- Philipp Koehn and Kevin Knight. 2002. Learning a translation lexicon from monolingual corpora. In ACL 2002 Workshop on Unsupervised Lexical Acquisition. Google ScholarDigital Library
- P. A. W. Lewis, P. B. Baxendale, and J. L. Bennet. 1967. Statistical discrimination of the synonym/antonym relationship between words. Journal of the ACM. Google ScholarDigital Library
- C. Peters and E. Picchi. 1995. Capturing the comparable: A system for querying comparable text corpora. In JADT'95--3rd International Conference on Statistical Analysis of Textual Data, pages 255--262.Google Scholar
- R. Rapp. 1995. Identifying word translations in nonparallel texts. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. Google ScholarDigital Library
- I. Shahzad, K. Ohtake, S. Masuyama, and K. Yamamoto. 1999. Identifying translations of compound nouns using non-aligned corpora. In Proceedings of the Workshop MAL'99, pages 108--113.Google Scholar
- K. Tanaka and Hideya Iwasaki. 1996. Extraction of lexical translations from non-aligned corpora. In International Conference on Computational Linguistics, COLING'96. Google ScholarDigital Library
- Naonori Ueda and Ryohei Nakano. 1995. Deterministic annealing variant of the EM algorithm. In Advances in Neural Information Processing Systems 7, pages 545--552.Google Scholar
- A. Vinokourov, J. Shawe-Taylor, and N. Cristianini. 2002. Finding language-independent semantic representation of text using kernel canonical correlation analysis. In Advances in Neural Information Processing Systems 12.Google Scholar
Recommendations
Unsupervised Word-Sense Disambiguation Using Bilingual Comparable Corpora
An unsupervised method for word-sense disambiguation using bilingual comparable corpora was developed. First, it extracts word associations, i.e., statistically significant pairs of associated words, from the corpus of each language. Then, it aligns ...
Clustering comparable corpora for bilingual lexicon extraction
HLT '11: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2We study in this paper the problem of enhancing the comparability of bilingual corpora in order to improve the quality of bilingual lexicons extracted from comparable corpora. We introduce a clustering-based approach for enhancing corpus comparability ...
A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora
AMTA '98: Proceedings of the Third Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information SoupWe present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using ...
Comments