Skip to main content

Independent Fusion of Words and Image for Multimodal Machine Translation

  • Conference paper
  • First Online:
Machine Translation (CCMT 2019)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1104))

Included in the following conference series:

  • 401 Accesses

Abstract

Multimodal machine translation which combines visual information of image has become one of the research hotpots in recent years. Most of the existing works project the image feature into the text semantic space and merged into the model in different ways. Actually, different source words may capture different visual information. Therefore, we propose a multimodal neural machine translation (MNMT) model that integrates the words and visual information of image independently. The word itself and different key similarity information of an image are independently fused into the text semantics of the word, thereby assisting and enhancing the textual semantic and corresponding visual information of different words. And then we use them for the calculation of the context vector of the attention of decoder of our model. In this paper, different experiments are carried out on the original English-German sentence pairs of the multimodal machine translation dataset, Multi30k, and the Indonesian-Chinese sentence pairs which is manually annotated by human. Compared with the existing MNMT model based on RNN, our model has a better performance and proves the effectiveness of it.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.statmt.org/wmt16/multimodal-task.html.

  2. 2.

    http://shannon.cs.illinois.edu/DenotationGraph/.

  3. 3.

    https://github.com/moses-smt/mosesdecoder.

  4. 4.

    https://translate.google.com/.

References

  1. Baltrusaitis, T., Ahuja, C., Morency, L.: Multimodal machine learning: a survey and taxonomy. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 423–443 (2018)

    Article  Google Scholar 

  2. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164 (2014)

    Google Scholar 

  3. Huang, P-Y., Liu, F., Shiang, S-R., Oh, J., Dyer, C.: Attention-based multimodal neural machine translation. In: Proceedings of the First Conference on Machine Translation, pp. 639–645 (2016)

    Google Scholar 

  4. Calixto, I., Liu, Q., Campbell, N.: Incorporating global visual features into attention-based neural machine translation. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 992–1003 (2017)

    Google Scholar 

  5. Calixto, I., Liu, Q., Campbell, N.: Doubly-Attentive decoder for multi-modal neural machine translation. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 1913–1924 (2017)

    Google Scholar 

  6. Caglayan, O., Aransa, W., Bardet, A., Garcia-Martinez, M.,: Bougares, F., Barrault, L.: LIUM-CVC submissions for WMT 2017 multimodal translation task. In: Proceedings of the Conference on Machine Translation (WMT), pp. 432–439 (2017)

    Google Scholar 

  7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  8. Russakovsky, O., Deng, J., Su, H., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  9. Luong, M., Pham, H., Manning, C.: Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1412–1421 (2015)

    Google Scholar 

  10. Papineni, K., Roukos, S., Ward, T., Zhu, W.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  11. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2015)

    Google Scholar 

  12. Sutskever, I., Vinyals, O., Le, V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems (2014)

    Google Scholar 

  13. Elliott, D., Kadar, A.: Imagination improves multimodal translation. In: Proceedings of the The 8th International Joint Conference on Natural Language Processing, pp. 130–141 (2017)

    Google Scholar 

  14. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (ICLR), pp. 1–14 (2015)

    Google Scholar 

  15. Liu, Q.: Survey of statistical machine translation. J. Chin. Inf. Process. 17(4), 2–13 (2003)

    Google Scholar 

  16. Brown, F., Cocke, J., Pietra, S., et al.: A statistical approach to machine translation. Comput. Linguist. 16, 79–85 (1990)

    Google Scholar 

  17. Brown, F., Pietra, S., Pietra, V.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)

    Google Scholar 

  18. Och, F., Ney, H.: Discriminative training and maximum entropy models for statistical machine translation. In: Association for Computational Linguistics (2012)

    Google Scholar 

  19. Liu, Y., Wang, K., Zong, C., Su, K.: A unified framework and models for integrating translation memory into phrase-based statistical machine translation. Comput. Speech Lang. 54, 176–206 (2019)

    Article  Google Scholar 

  20. Zhang, J., Zong, C.: Learning a phrase-based translation model from monolingual data with application to domain adaptation. In: The 51st Annual Meeting of the Association for Computational Linguistics (ACL) (2013)

    Google Scholar 

  21. Tu, Z., Liu Y, Hwang, Y., Liu, Q., Lin, S.: Dependency forest for statistical machine translation. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 1092–1100 (2010)

    Google Scholar 

  22. Kalchbrenner, N., Blunsom, P.: Recurrent continuous translation models. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1700–1709 (2013)

    Google Scholar 

  23. Cho, K., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734 (2014)

    Google Scholar 

  24. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations (ICLR) (2015)

    Google Scholar 

  25. Bradbury, J., Merity, S., Xiong, C., Socher, R.: Quasi-recurrent neural networks. In: Association for the Advancement of Artificial Intelligence (2017)

    Google Scholar 

  26. Kalchbrenner, N., Espeholt, L., Karen, S., Oord, A., Graves, A., Kavukcuoglu, K.: Neural Machine Translation in Linear Time. arXiv preprint https://arxiv.org/abs/1610.10099 (2016)

  27. Gehring, J., Auli, M., Grangier, D., Yarats, D, Dauphin, Y.: Convolutional sequence to sequence learning. In: Proceeding ICML 2017 Proceedings of the 34th International Conference on Machine Learning (2017)

    Google Scholar 

  28. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention Is All You Need. arXiv preprint https://arxiv.org/abs/1706.03762 (2017)

  29. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS) (2015)

    Google Scholar 

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China (61976062) and the Science and Technology Program of Guangzhou, China (201904010303).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xia Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ma, J., Qin, S., Chen, M., Li, X. (2019). Independent Fusion of Words and Image for Multimodal Machine Translation. In: Huang, S., Knight, K. (eds) Machine Translation. CCMT 2019. Communications in Computer and Information Science, vol 1104. Springer, Singapore. https://doi.org/10.1007/978-981-15-1721-1_4

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-1721-1_4

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-1720-4

  • Online ISBN: 978-981-15-1721-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics