Skip to main content

Enhancing Multilingual Biomedical Terminologies via Machine Translation from Parallel Corpora

  • Conference paper
Book cover Natural Language Processing and Information Systems (NLDB 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8455))

Abstract

Creating and maintaining terminologies by human experts is known to be a resource-expensive task. We here report on efforts to computationally support this process by treating term acquisition as a machine translation-guided classification problem capitalizing on parallel multilingual corpora. Experiments are described for French, German, Spanish and Dutch parts of a multilingual biomedical terminology, for which we generated 18k, 23k, 19k and 12k new terms and synonyms, respectively; about one half relate to concepts that have not been lexically labeled before. Based on expert assessment of a sample of the novel German segment about 80% of these newly acquired terms were judged as linguistically correct and bio-medically reasonable additions to the terminology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bodenreider, O.: The Unified Medical Language System (Umls): Integrating biomedical terminology. Nucleic Acids Research 32(Database issue), D267–D270 (2004)

    Google Scholar 

  2. Bouamor, D., Popescu, A., Semmar, N., Zweigenbaum, P.: Building specialized bilingual lexicons using large-scale background knowledge. In: EMNLP 2013 – Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. A meeting of SIGDAT, a Special Interest Group of the ACL, Seattle, WA, USA, October 18-21, pp. 479–489. Association for Computational Linguistics, ACL (2013)

    Google Scholar 

  3. Bouamor, D., Semmar, N., Zweigenbaum, P.: Identifying bilingual multi-word expressions for statistical machine translation. In: LREC 2012 – Proceedings of the 8th International Conference on Language Resources and Evaluation, Istanbul, Turkey, May 23-25, pp. 674–679. European Language Resources Association (ELRA, Paris (2012)

    Google Scholar 

  4. Ştefănescu, D.: Mining for term translations in comparable corpora. In: BUCC 5 – Proceedings of the 5th Workshop on Building and Using Comparable Corpora: Language Resources for Machine Translation in Less-Resourced Languages and Domains @ LREC 2012: 8th International Conference on Language Resources and Evaluation, Istanbul, Turkey, pp. 98–103. European Language Resources Association (ELRA, Paris (2012)

    Google Scholar 

  5. Déjean, H., Gaussier, E., Renders, J.M., Sadat, F.: Automatic processing of multilingual medical terminology: Applications to thesaurus enrichment and cross-language information retrieval. Artificial Intelligence in Medicine 33(2), 111–124 (2005)

    Article  Google Scholar 

  6. Deléger, L., Merkel, M., Zweigenbaum, P.: Enriching medical terminologies: An approach based on aligned corpora. In: Hasman, A., Haux, R., van der Lei, J., De Clercq, E., Roger France, F.H. (eds.) MIE 2006 – Proceedings of the 20th International Congress of the European Federation for Medical Informatics, Maastricht, The Netherlands, August 27-30. Studies in Health Technology and Informatics, vol. 124, pp. 747–752. IOS Press, Amsterdam (2006)

    Google Scholar 

  7. Deléger, L., Merkel, M., Zweigenbaum, P.: Translating medical terminologies through word alignment in parallel text corpora. Journal of Biomedical Informatics 42(4), 692–701 (2009)

    Article  Google Scholar 

  8. Delpech, E., Daille, B., Morin, E., Lemaire, C.: Extraction of domain-specific bilingual lexicon from comparable corpora: Compositional translation and ranking. In: COLING 2012 – Proceedings of the 24th International Conference on Computational Linguistics: Technical Papers, Mumbai, India, December 8-15, pp. 745–762. Indian Institute of Technology (2012)

    Google Scholar 

  9. Frantzi, K.T., Ananiadou, S., Mima, H.: Automatic recognition of multi-word terms: The C-value/NC-value method. International Journal on Digital Libraries 3(2), 115–130 (2000)

    Article  Google Scholar 

  10. Hahn, U., Buyko, E., Landefeld, R., Mühlhausen, M., Poprat, M., Tomanek, K., Wermter, J.: An overview of JCoRe, the Julie Lab Uima component repository. In: Proceedings of the LREC 2008 Workshop ‘Towards Enhanced Interoperability for Large HLT Systems: UIMA for NLP’, Marrakech, Morocco, pp. 1–7. European Language Resources Association (ELRA, Paris (2008)

    Google Scholar 

  11. Hahn, U., Markó, K.G., Schulz, S.: Subword clusters as light-weight interlingua for multilingual document retrieval. In: MT Summit X – Proceedings of the 10th Machine Translation Summit of the International Association for Machine Translation, Phuket, Thailand, September 12-16, pp. 17–24. Asia-Pacific Association for Machine Translation, AAMT (2005)

    Google Scholar 

  12. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The Weka data mining software: An update. ACM SIGKDD Explorations 11(1), 10–18 (2009)

    Article  Google Scholar 

  13. Hellrich, J., Hahn, U.: The julie Lab mantra system for the clef-er 2013 challenge. In: CLEF 2012, CLEF 2013 Evaluation Labs and Workshop Online Working Notes, Valencia, Spain (September 25, 2013), http://www.clef-initiative.eu/documents/71612/a132d6c9-b0f1-48a4-a0c5-648e5127e229

  14. Koehn, P.: Statistical Machine Translation. Cambridge University Press, Cambridge (2010)

    MATH  Google Scholar 

  15. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: Open source toolkit for statistical machine translation. In: ACL 2007 – Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic, June 25-27. Proceedings of the Interactive Poster and Demonstration Sessions, vol. Companion, pp. 177–180. Association for Computational Linguistics, ACL (2007)

    Google Scholar 

  16. Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: HLT-NAACL 2003 – Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Edmonton, Canada, May 27-June 1, vol. 1, pp. 48–54. Association for Computational Linguistics (ACL), Stroudsburg (2003)

    Chapter  Google Scholar 

  17. Laroche, A., Langlais, P.: Revisiting context-based projection methods for term-translation spotting in comparable corpora. In: COLING 2010 – Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China, August 23-27, pp. 617–625. Tsinghua University Press, Beijing (2010)

    Google Scholar 

  18. Lefever, E., Macken, L., Hoste, V.: Language-independent bilingual terminology extraction from a multilingual parallel corpus. In: EACL 2009 – Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, Greece, March 30-April 3, pp. 496–504. Association for Computational Linguistics (2009)

    Google Scholar 

  19. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Linguisticae Investigationes 30(1), 3–26 (2007)

    Article  Google Scholar 

  20. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)

    Article  MATH  Google Scholar 

  21. Rebholz-Schuhmann, D., et al.: Entity recognition in parallel multi-lingual biomedical corpora: The Clef-ER Laboratory overview. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B. (eds.) CLEF 2013. LNCS, vol. 8138, pp. 353–367. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  22. Resnik, P., Smith, N.A.: The Web as a parallel corpus. Computational Linguistics 29(3), 349–380 (2003)

    Article  Google Scholar 

  23. Skadiņa, I., Aker, A., Mastropavlos, N., Su, F., Tufiş, D., Verlic, M., Vasiļjevs, A., Babych, B., Clough, P., Gaizauskas, R., Glaros, N., Lestari Paramita, M., Pinnis, M.: Collecting and using comparable corpora for statistical machine translation. In: LREC 2012 – Proceedings of the 8th International Conference on Language Resources and Evaluation, Istanbul, Turkey, May 23-25, pp. 438–445. European Language Resources Association (ELRA, Paris (2012)

    Google Scholar 

  24. Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L., Eilbeck, K., Ireland, A., Mungall, C.J., Leontis, N., Rocca-Serra, P., Ruttenberg, A., Sansone, S.A., Scheuermann, R.H., Shah, N.H., Whetzel, P.L., Lewis, S.E.: The Obo Foundry: Coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology 25(11), 1251–1255 (2007)

    Article  Google Scholar 

  25. Tiedemann, J.: News from Opus: A collection of multilingual parallel corpora with tools and interfaces. In: Nicolov, N., Angelova, G., Mitkov, R. (eds.) RANLP 2009 – Recent Advances in Natural Language Processing. No. 309 in Current Issues in Linguistic Theory, vol. V, pp. 237–248. John Benjamins, Amsterdam (2009)

    Google Scholar 

  26. Véronis, J.: From the Rosetta stone to the information society. A survey of parallel text processing. In: Véronis, J. (ed.) Parallel Text Processing. Alignment and Use of Translation Corpora. No. 13 in Text, Speech and Language Technology, pp. 1–24. Kluwer Academic Publ., Dordrecht (2000)

    Google Scholar 

  27. Vintar, Š.: Bilingual term recognition revisited: The bag-of-equivalents term alignment approach and its evaluation. Terminology 16(2), 141–158 (2010)

    Article  Google Scholar 

  28. Weller, M., Gojun, A., Heid, U., Daille, B., Harastani, R.: Simple methods for dealing with term variation and term alignment. In: TIA 2011 – Proceedings of the 9th International Conference on Terminology and Artificial Intelligence, Paris, France, November 8-10, pp. 87–93 (2011)

    Google Scholar 

  29. Wermter, J., Hahn, U.: Paradigmatic modifiability statistics for the extraction of of complex multi-word terms. In: HLT/EMNLP 2005 – Proceedings of the Human Language Technology Conference and the Conference on Empirical Methods in Natural Language Processing, Vancouver, BC, Canada, October 6-8, pp. 843–850. Association for Computational Linguistics (ACL), East Stroudsburg (2005)

    Google Scholar 

  30. Whetzel, P.L., Noy, N.F., Shah, N.H., Alexander, P.R., Nyulas, C., Tudorache, T., Musen, M.: BioPortal: Enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Nucleic Acids Research 39(Web Server issue), W541–W545 (2011)

    Google Scholar 

  31. Wu, C., Xia, F., Deléger, L., Solti, I.: Statistical machine translation for biomedical text: Are we there yet? In: AMIA 2011 – Proceedings of the Annual Symposium of the American Medical Informatics Association. Improving Health: Informatics and IT Changing the World, Washington, DC, USA, October 22-26, pp. 1290–1299. American Medical Informatics Association (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Hellrich, J., Hahn, U. (2014). Enhancing Multilingual Biomedical Terminologies via Machine Translation from Parallel Corpora. In: Métais, E., Roche, M., Teisseire, M. (eds) Natural Language Processing and Information Systems. NLDB 2014. Lecture Notes in Computer Science, vol 8455. Springer, Cham. https://doi.org/10.1007/978-3-319-07983-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-07983-7_2

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-07982-0

  • Online ISBN: 978-3-319-07983-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics