Enhancing Multilingual Biomedical Terminologies via Machine Translation from Parallel Corpora

Hellrich, Johannes; Hahn, Udo

doi:10.1007/978-3-319-07983-7_2

Johannes Hellrich¹⁸ &
Udo Hahn¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8455))

Included in the following conference series:

International Conference on Applications of Natural Language to Data Bases/Information Systems

1554 Accesses
1 Citations

Abstract

Creating and maintaining terminologies by human experts is known to be a resource-expensive task. We here report on efforts to computationally support this process by treating term acquisition as a machine translation-guided classification problem capitalizing on parallel multilingual corpora. Experiments are described for French, German, Spanish and Dutch parts of a multilingual biomedical terminology, for which we generated 18k, 23k, 19k and 12k new terms and synonyms, respectively; about one half relate to concepts that have not been lexically labeled before. Based on expert assessment of a sample of the novel German segment about 80% of these newly acquired terms were judged as linguistically correct and bio-medically reasonable additions to the terminology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bodenreider, O.: The Unified Medical Language System (Umls): Integrating biomedical terminology. Nucleic Acids Research 32(Database issue), D267–D270 (2004)
Google Scholar
Bouamor, D., Popescu, A., Semmar, N., Zweigenbaum, P.: Building specialized bilingual lexicons using large-scale background knowledge. In: EMNLP 2013 – Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. A meeting of SIGDAT, a Special Interest Group of the ACL, Seattle, WA, USA, October 18-21, pp. 479–489. Association for Computational Linguistics, ACL (2013)
Google Scholar
Bouamor, D., Semmar, N., Zweigenbaum, P.: Identifying bilingual multi-word expressions for statistical machine translation. In: LREC 2012 – Proceedings of the 8th International Conference on Language Resources and Evaluation, Istanbul, Turkey, May 23-25, pp. 674–679. European Language Resources Association (ELRA, Paris (2012)
Google Scholar
Ştefănescu, D.: Mining for term translations in comparable corpora. In: BUCC 5 – Proceedings of the 5th Workshop on Building and Using Comparable Corpora: Language Resources for Machine Translation in Less-Resourced Languages and Domains @ LREC 2012: 8th International Conference on Language Resources and Evaluation, Istanbul, Turkey, pp. 98–103. European Language Resources Association (ELRA, Paris (2012)
Google Scholar
Déjean, H., Gaussier, E., Renders, J.M., Sadat, F.: Automatic processing of multilingual medical terminology: Applications to thesaurus enrichment and cross-language information retrieval. Artificial Intelligence in Medicine 33(2), 111–124 (2005)
Article Google Scholar
Deléger, L., Merkel, M., Zweigenbaum, P.: Enriching medical terminologies: An approach based on aligned corpora. In: Hasman, A., Haux, R., van der Lei, J., De Clercq, E., Roger France, F.H. (eds.) MIE 2006 – Proceedings of the 20th International Congress of the European Federation for Medical Informatics, Maastricht, The Netherlands, August 27-30. Studies in Health Technology and Informatics, vol. 124, pp. 747–752. IOS Press, Amsterdam (2006)
Google Scholar
Deléger, L., Merkel, M., Zweigenbaum, P.: Translating medical terminologies through word alignment in parallel text corpora. Journal of Biomedical Informatics 42(4), 692–701 (2009)
Article Google Scholar
Delpech, E., Daille, B., Morin, E., Lemaire, C.: Extraction of domain-specific bilingual lexicon from comparable corpora: Compositional translation and ranking. In: COLING 2012 – Proceedings of the 24th International Conference on Computational Linguistics: Technical Papers, Mumbai, India, December 8-15, pp. 745–762. Indian Institute of Technology (2012)
Google Scholar
Frantzi, K.T., Ananiadou, S., Mima, H.: Automatic recognition of multi-word terms: The C-value/NC-value method. International Journal on Digital Libraries 3(2), 115–130 (2000)
Article Google Scholar
Hahn, U., Buyko, E., Landefeld, R., Mühlhausen, M., Poprat, M., Tomanek, K., Wermter, J.: An overview of JCoRe, the Julie Lab Uima component repository. In: Proceedings of the LREC 2008 Workshop ‘Towards Enhanced Interoperability for Large HLT Systems: UIMA for NLP’, Marrakech, Morocco, pp. 1–7. European Language Resources Association (ELRA, Paris (2008)
Google Scholar
Hahn, U., Markó, K.G., Schulz, S.: Subword clusters as light-weight interlingua for multilingual document retrieval. In: MT Summit X – Proceedings of the 10th Machine Translation Summit of the International Association for Machine Translation, Phuket, Thailand, September 12-16, pp. 17–24. Asia-Pacific Association for Machine Translation, AAMT (2005)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The Weka data mining software: An update. ACM SIGKDD Explorations 11(1), 10–18 (2009)
Article Google Scholar
Hellrich, J., Hahn, U.: The julie Lab mantra system for the clef-er 2013 challenge. In: CLEF 2012, CLEF 2013 Evaluation Labs and Workshop Online Working Notes, Valencia, Spain (September 25, 2013), http://www.clef-initiative.eu/documents/71612/a132d6c9-b0f1-48a4-a0c5-648e5127e229
Koehn, P.: Statistical Machine Translation. Cambridge University Press, Cambridge (2010)
MATH Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: Open source toolkit for statistical machine translation. In: ACL 2007 – Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic, June 25-27. Proceedings of the Interactive Poster and Demonstration Sessions, vol. Companion, pp. 177–180. Association for Computational Linguistics, ACL (2007)
Google Scholar
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: HLT-NAACL 2003 – Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Edmonton, Canada, May 27-June 1, vol. 1, pp. 48–54. Association for Computational Linguistics (ACL), Stroudsburg (2003)
Chapter Google Scholar
Laroche, A., Langlais, P.: Revisiting context-based projection methods for term-translation spotting in comparable corpora. In: COLING 2010 – Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China, August 23-27, pp. 617–625. Tsinghua University Press, Beijing (2010)
Google Scholar
Lefever, E., Macken, L., Hoste, V.: Language-independent bilingual terminology extraction from a multilingual parallel corpus. In: EACL 2009 – Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, Greece, March 30-April 3, pp. 496–504. Association for Computational Linguistics (2009)
Google Scholar
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Linguisticae Investigationes 30(1), 3–26 (2007)
Article Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)
Article MATH Google Scholar
Rebholz-Schuhmann, D., et al.: Entity recognition in parallel multi-lingual biomedical corpora: The Clef-ER Laboratory overview. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B. (eds.) CLEF 2013. LNCS, vol. 8138, pp. 353–367. Springer, Heidelberg (2013)
Chapter Google Scholar
Resnik, P., Smith, N.A.: The Web as a parallel corpus. Computational Linguistics 29(3), 349–380 (2003)
Article Google Scholar
Skadiņa, I., Aker, A., Mastropavlos, N., Su, F., Tufiş, D., Verlic, M., Vasiļjevs, A., Babych, B., Clough, P., Gaizauskas, R., Glaros, N., Lestari Paramita, M., Pinnis, M.: Collecting and using comparable corpora for statistical machine translation. In: LREC 2012 – Proceedings of the 8th International Conference on Language Resources and Evaluation, Istanbul, Turkey, May 23-25, pp. 438–445. European Language Resources Association (ELRA, Paris (2012)
Google Scholar
Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L., Eilbeck, K., Ireland, A., Mungall, C.J., Leontis, N., Rocca-Serra, P., Ruttenberg, A., Sansone, S.A., Scheuermann, R.H., Shah, N.H., Whetzel, P.L., Lewis, S.E.: The Obo Foundry: Coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology 25(11), 1251–1255 (2007)
Article Google Scholar
Tiedemann, J.: News from Opus: A collection of multilingual parallel corpora with tools and interfaces. In: Nicolov, N., Angelova, G., Mitkov, R. (eds.) RANLP 2009 – Recent Advances in Natural Language Processing. No. 309 in Current Issues in Linguistic Theory, vol. V, pp. 237–248. John Benjamins, Amsterdam (2009)
Google Scholar
Véronis, J.: From the Rosetta stone to the information society. A survey of parallel text processing. In: Véronis, J. (ed.) Parallel Text Processing. Alignment and Use of Translation Corpora. No. 13 in Text, Speech and Language Technology, pp. 1–24. Kluwer Academic Publ., Dordrecht (2000)
Google Scholar
Vintar, Š.: Bilingual term recognition revisited: The bag-of-equivalents term alignment approach and its evaluation. Terminology 16(2), 141–158 (2010)
Article Google Scholar
Weller, M., Gojun, A., Heid, U., Daille, B., Harastani, R.: Simple methods for dealing with term variation and term alignment. In: TIA 2011 – Proceedings of the 9th International Conference on Terminology and Artificial Intelligence, Paris, France, November 8-10, pp. 87–93 (2011)
Google Scholar
Wermter, J., Hahn, U.: Paradigmatic modifiability statistics for the extraction of of complex multi-word terms. In: HLT/EMNLP 2005 – Proceedings of the Human Language Technology Conference and the Conference on Empirical Methods in Natural Language Processing, Vancouver, BC, Canada, October 6-8, pp. 843–850. Association for Computational Linguistics (ACL), East Stroudsburg (2005)
Google Scholar
Whetzel, P.L., Noy, N.F., Shah, N.H., Alexander, P.R., Nyulas, C., Tudorache, T., Musen, M.: BioPortal: Enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Nucleic Acids Research 39(Web Server issue), W541–W545 (2011)
Google Scholar
Wu, C., Xia, F., Deléger, L., Solti, I.: Statistical machine translation for biomedical text: Are we there yet? In: AMIA 2011 – Proceedings of the Annual Symposium of the American Medical Informatics Association. Improving Health: Informatics and IT Changing the World, Washington, DC, USA, October 22-26, pp. 1290–1299. American Medical Informatics Association (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Jena University Language & Information Engineering (Julie) Lab, Friedrich-Schiller-Universität Jena, Jena, Germany
Johannes Hellrich & Udo Hahn

Authors

Johannes Hellrich
View author publications
You can also search for this author in PubMed Google Scholar
Udo Hahn
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Conservatoire National des Arts et Métiers, Computer Science,, 2 rue Conté, 75003, Paris, France
Elisabeth Métais
Cirad, TETIS, 500 rue J.F. Breton, 34093, Montpellier Cedex 5, France
Mathieu Roche
Irstea, TETIS, 500 rue J.F. Breton, 34093, Montpellier Cedex 5, France
Maguelonne Teisseire

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hellrich, J., Hahn, U. (2014). Enhancing Multilingual Biomedical Terminologies via Machine Translation from Parallel Corpora. In: Métais, E., Roche, M., Teisseire, M. (eds) Natural Language Processing and Information Systems. NLDB 2014. Lecture Notes in Computer Science, vol 8455. Springer, Cham. https://doi.org/10.1007/978-3-319-07983-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-07983-7_2
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07982-0
Online ISBN: 978-3-319-07983-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics