Skip to main content

Trainable Coarse Bilingual Grammars for Parallel Text Bracketing

  • Chapter
Book cover Natural Language Processing Using Very Large Corpora

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 11))

Abstract

We describe two new strategies to automatic bracketing of parallel corpora, with particular application to languages where prior grammar resources are scarce: (1) coarse bilingual grammars, and (2) unsupervised training of such grammars via EM (expectation-maximization). Both methods build upon a formalism we recently introduced called stochastic inversion transduction grammars. The first approach borrows a coarse monolingual grammar into our bilingual formalism, in order to transfer knowledge of one language’s constraints to the task of bracketing the texts in both languages. The second approach generalizes the inside-outside algorithm to adjust the grammar parameters so as to improve the likelihood of a training corpus. Preliminary experiments on parallel English-Chinese text are supportive of these strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Baker, J. K. 1979. Trainable grammars for speech recognition. In Klatt and Wolf (eds), Speech Communication Papers for the 97th Meeting of the Acoustic Society of America, 547–550.

    Google Scholar 

  • Black, E., Garside, R. and Leech, G. (eds). 1993. Statistically-driven computer gram- mars of English: The IBM/Lancaster approach. Amsterdam: Editions Rodopi.

    Google Scholar 

  • Brill, E. 1993. A corpus-based approach to language learning. University of Pennsylvania dissertation.

    Google Scholar 

  • Brown, P. F., Cocke, J., DellaPietra, S. A., DellaPietra, V. J., Jelinek, F., Lafferty, J. D., Mercer, R. L. and Roossin, P. S. 1990. A statistical approach to machine translation. Computational Linguistics, 16 (2): 29–85.

    Google Scholar 

  • Brown, P. F., DellaPietra, S. A., DellaPietra, V. J. and Mercer, R. L. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19 (2): 263–311.

    Google Scholar 

  • Church, K. W. 1993. Char-align: A program for aligning parallel texts at the character level. In Proceedings of the 31st Annual Conference of the Association for Computational Linguistics1–8, Columbus, OH.

    Google Scholar 

  • Cranias, L., Papageorgiou, H. and Peperidis, S. 1994. A matching technique in example-based machine translation. In Proceedings of the Fifteenth International Conference on Computational Linguistics 100-104, Kyoto.

    Google Scholar 

  • Dagan, I., Church, K. W. and Gale, W. A. 1993. Robust bilingual word alignment for machine aided translation. In Proceedings of the Workshop on Very Large Corpora 1-8, Columbus, OH.

    Google Scholar 

  • Fung, P. and Church, K. W. 1994. K-vec: A new approach for aligning parallel texts. In Proceedings of the Fifteenth International Conference on Computational Linguistics,1096–1102, Kyoto.

    Google Scholar 

  • Fung, P. and McKeown, K. 1994. Aligning noisy parallel corpora across language groups: Word pair feature matching by dynamic time warping. In AMTA-94 Association for Machine Translation in the Americas81–88, Columbia, Maryland.

    Google Scholar 

  • Fung, P. and Wu, D. 1994. Statistical augmentation of a Chinese machine-readable dictionary. In Proceedings of the Second Annual Workshop on Very Large Corpora, 69–85, Kyoto.

    Google Scholar 

  • Gale, W. A. and Church, K. W. 1991. A program for aligning sentences in bilingual corpora. In Proceedings of the 29th Annual Conference of the Association for Computational Linguistics, 177–184, Berkeley.

    Google Scholar 

  • Gale, W. A., Church, K. W. and Yarowsky, D. 1992. Using bilingual materials to develop word sense disambiguation methods. In TMI-92, Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation, 101–112, Montreal.

    Google Scholar 

  • Grishman, R. 1994. Iterative alignment of syntactic structures for a bilingual corpus. In Proceedings of the Second Annual Workshop on Very Large Corpora,57–68, Kyoto.

    Google Scholar 

  • Kaji, H., Kida, Y. and Morimoto, Y. 1992. Learning translation templates from bilingual text. In Proceedings of the Fourteenth International Conference on Computational Linguistics,672–678, Nantes.

    Google Scholar 

  • Lari, K. and Young, S.-J. 1990. The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech and Language, 4: 35–56.

    Article  Google Scholar 

  • Marcus, M. 1991. The automatic acquisition of linguistic structure from large corpora: An overview of work at the university of pennsylvania. In Working Notes from the Spring Symposium on Machine Learning of Natural Language and Ontology 123–125, Stanford University, Stanford, CA. AAAI.

    Google Scholar 

  • Matsumoto, Y., Ishimoto, H. and Utsuro, T. 1993. Structural matching of parallel texts. In Proceedings of the 31st Annual Conference of the Association for Computational Linguistics, 23–30, Columbus, OH.

    Chapter  Google Scholar 

  • Pereira, F. and Schabes, Y. 1992. Inside-outside reestimation from partially bracketed corpora. In Proceedings of the 30th Annual Conference of the Association for Computational Linguistics, 128–135, Newark, DE.

    Chapter  Google Scholar 

  • Sadler, V. and Vendelmans, R. 1990. Pilot implementation of a bilingual knowledge bank. In Proceedings of the Thirteenth International Conference on Computational Linguistics,449–451, Helsinki.

    Google Scholar 

  • Wu, D. 1994. Aligning a parallel English-Chinese corpus statistically with lexical criteria. In Proceedings of the 32nd Annual Conference of the Association for Computational Linguistics, 80–87, Las Cruces, New Mexico.

    Google Scholar 

  • Wu, D. 1995a. Grammarless extraction of phrasal translation examples from parallel texts. In TMI-95, Proceedings of the Sixth International Conference on Theoretical and Methodological Issues in Machine Translation, volume 2, 354–372, Leuven, Belgium.

    Google Scholar 

  • Wu, D. 1995b. Stochastic inversion transduction grammars, with application to segmentation, bracketing, and alignment of parallel corpora. In Proceedings of IJCAI-95 Fourteenth International Joint Conference on Artificial Intelligence 1328–1334, Montreal.

    Google Scholar 

  • Wu, D. and Fung, P. 1994. Improving Chinese tokenization with linguistic filters on statistical lexical acquisition. In Proceedings of the Fourth Conference on Applied Natural Language Processing,180–181, Stuttgart.

    Google Scholar 

  • Wu, D. and Xia, X. 1995. Large-scale automatic extraction of an English-Chinese lexicon. Machine Translation, 9 (3–4): 285–313.

    Article  Google Scholar 

Download references

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1997 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Wu, D. (1997). Trainable Coarse Bilingual Grammars for Parallel Text Bracketing. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds) Natural Language Processing Using Very Large Corpora. Text, Speech and Language Technology, vol 11. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2390-9_15

Download citation

  • DOI: https://doi.org/10.1007/978-94-017-2390-9_15

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-90-481-5349-7

  • Online ISBN: 978-94-017-2390-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics