skip to main content
10.3115/1218955.1219022dlproceedingsArticle/Chapter ViewAbstractPublication PagesaclConference Proceedingsconference-collections
Article
Free Access

A geometric view on bilingual lexicon extraction from comparable corpora

Published:21 July 2004Publication History

ABSTRACT

We present a geometric view on bilingual lexicon extraction from comparable corpora, which allows to re-interpret the methods proposed so far and identify unresolved problems. This motivates three new methods that aim at solving these problems. Empirical evaluation shows the strengths and weaknesses of these methods, as well as a significant gain in the accuracy of extracted lexicons.

References

  1. F. R. Bach and M. I. Jordan. 2001. Kernel independent component analysis. Journal of Machine Learning Research. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Besançon, M. Rajman, and J.-C. Chappelier. 1999. Textual similarities based on a distributional approach. In Proceedings of the Tenth International Workshop on Database and Expert Systems Applications (DEX'99), Florence, Italy. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407.Google ScholarGoogle ScholarCross RefCross Ref
  4. H. Dejean, E. Gaussier, and F. Sadat. 2002. An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In International Conference on Computational Linguistics, COLING'02. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1): 1--38.Google ScholarGoogle Scholar
  6. Mona Diab and Steve Finch. 2000. A statistical word-level translation model for comparable corpora. In Proceeding of the Conference on Content-Based Multimedia Information Access (RIAO).Google ScholarGoogle ScholarCross RefCross Ref
  7. Pascale Fung. 2000. A statistical view on bilingual lexicon extraction - from parallel corpora to nonparallel corpora. In J. Véronis, editor, Parallel Text Processing. Kluwer Academic Publishers.Google ScholarGoogle Scholar
  8. G. Grefenstette. 1994. Explorations in Automatic Thesaurus Construction. Kluwer Academic Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pages 289--296. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Thomas Hofmann. 2000. Learning the similarity of documents: An information-geometric approach to document retrieval and categorization. In Advances in Neural Information Processing Systems 12, page 914. MIT Press.Google ScholarGoogle Scholar
  11. Tommi S. Jaakkola and David Haussler. 1999. Exploiting generative models in discriminative classifiers. In Advances in Neural Information Processing Systems 11, pages 487--493. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Philipp Koehn and Kevin Knight. 2002. Learning a translation lexicon from monolingual corpora. In ACL 2002 Workshop on Unsupervised Lexical Acquisition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. A. W. Lewis, P. B. Baxendale, and J. L. Bennet. 1967. Statistical discrimination of the synonym/antonym relationship between words. Journal of the ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Peters and E. Picchi. 1995. Capturing the comparable: A system for querying comparable text corpora. In JADT'95--3rd International Conference on Statistical Analysis of Textual Data, pages 255--262.Google ScholarGoogle Scholar
  15. R. Rapp. 1995. Identifying word translations in nonparallel texts. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. I. Shahzad, K. Ohtake, S. Masuyama, and K. Yamamoto. 1999. Identifying translations of compound nouns using non-aligned corpora. In Proceedings of the Workshop MAL'99, pages 108--113.Google ScholarGoogle Scholar
  17. K. Tanaka and Hideya Iwasaki. 1996. Extraction of lexical translations from non-aligned corpora. In International Conference on Computational Linguistics, COLING'96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Naonori Ueda and Ryohei Nakano. 1995. Deterministic annealing variant of the EM algorithm. In Advances in Neural Information Processing Systems 7, pages 545--552.Google ScholarGoogle Scholar
  19. A. Vinokourov, J. Shawe-Taylor, and N. Cristianini. 2002. Finding language-independent semantic representation of text using kernel canonical correlation analysis. In Advances in Neural Information Processing Systems 12.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image DL Hosted proceedings
    ACL '04: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
    July 2004
    729 pages

    Publisher

    Association for Computational Linguistics

    United States

    Publication History

    • Published: 21 July 2004

    Qualifiers

    • Article

    Acceptance Rates

    Overall Acceptance Rate85of443submissions,19%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader