Skip to main content
Log in

Multimodal attention-driven visual question answering for Malayalam

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Visual question answering is a challenging task that necessitates for sophisticated reasoning over the visual elements to provide an accurate answer to a question. Majority of the state-of-the-art VQA models are only applicable to English questions. However, applications such as visual assistance and tourism necessitate the incorporation of multilingual VQA systems. This paper presents an effective deep learning framework for Malayalam visual question answering (MVQA), which can answer a specific natural language question about an image in Malayalam. As there is no available dataset in English–Malayalam VQA, a MVQA dataset was created by translating English question–answer pairs from the visual genome dataset. The paper proposes an attention-driven MVQA model on the developed dataset. The proposed MVQA model uses a deep learning-based co-attention mechanism to jointly learn the attention for images and Malayalam questions. A second-order multimodal factorized high-order pooling is used for multi modal feature fusion. Different VQA models using combinations of classical CNNs and RNNs were experimented on the developed MVQA dataset, and the performance was compared against the proposed attention-driven model. Experimental results show that the proposed attention-driven MVQA model achieves state-of-the-art results as compared to other models for MVQA on the custom Malayalam VQA dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data availability

The authors declare that all the data being used in the design and production cum layout of the manuscript are declared in the manuscript.

References

  1. Abacha AB, Hasan SA , Datla VV, Liu J, Demner-Fushman D, Müller H (2019) VAQ-Med: overview of the medical visual question answering task at ImageCLEF 2019. CLEF (Working Notes) 2(6)

  2. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering

  3. Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Vqa DP (2015) Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433

  4. Asher R (2013) Malayalam. Routledge, London, p 10

    Book  Google Scholar 

  5. Bojanowski Piotr, Grave Edouard, Joulin Armand, Mikolov Tomas (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146

    Article  Google Scholar 

  6. Cascante-Bonilla P, Wu H, Wang L, Feris RS, Ordonez V (2022) Simvqa: Exploring simulated environments for visual question answering. arXiv:2203.17219 [cs], 03

  7. Dancette C, Cadène R, Teney D, Cord M (2021) Beyond question-based biases: assessing multimodal shortcut learning in visual question answering

  8. Devlin J, Chang M-W, Lee K, Toutanova K(2018) Bert: pre-training of deep bidirectional transformers for language understanding, p 10

  9. Dey R, Salem FM (2017) Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS). IEEE, pp 1597–1600

  10. Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847 [cs], 09

  11. Garlapati Anusha, Malisetty Neeraj, Narayanan Gayathri (2022) Image captioning from wikipedia for multi-language using deep learning models. Asian J Converg Technol (AJCT) 8(1):94–101 (ISSN-2350-1146)

    Article  Google Scholar 

  12. Geman Donald, Geman Stuart, Hallonquist Neil, Younes Laurent (2015) Visual turing test for computer vision systems. Proc Natl Acad Sci 112(12):3618–3623

    Article  Google Scholar 

  13. Girshick R (2015) Fast R-CNN. In: Proceedings of the IEEE international conference on computer vision, pp. 1440–1448

  14. Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering

  15. Grave E, Bojanowski P, Gupta P, Joulin A, Mikolov T (2018) Learning word vectors for 157 languages. arXiv:1802.06893 [cs], 03

  16. Gu J, Zhao H, Lin Z, Li S, Cai J, Ling M (2019) Scene graph generation with external knowledge and image reconstruction

  17. Gupta D, Lenka P, Ekbal A, Bhattacharyya P (2020) A unified framework for multilingual and code-mixed visual question answering. In: Proceedings of the 1st conference of the Asia-Pacific chapter of the association for computational linguistics and the 10th international joint conference on natural language processing, pp. 900–913 (2020)

  18. Gupta D, Lenka P, Ekbal A, Bhattacharyya P (2020) A unified framework for multilingual and code-mixed visual question answering, p 12

  19. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  20. Kazemi V, Elqursh A (2017) Show, ask, attend, and answer: a strong baseline for visual question answering. arXiv:1704.03162 [cs], 04

  21. Lu J, Yang J, Batra D, Parikh D (2017) Dense-captioning events in videos

  22. Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering

  23. Malinowski M, Doersch C, Santoro A, Battaglia P (2018) Learning visual question answering by bootstrapping hard attention

  24. Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input

  25. Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: a neural-based approach to answering questions about images

  26. Malinowski M, Rohrbach M, Fritz M (2017) Ask your neurons: a deep learning approach to visual question answering. Int J Comput Vis 125:110–135

    Article  MathSciNet  Google Scholar 

  27. Mithun NC, Panda R, Papalexakis EE, Roy-Chowdhury AK (2018) Webly supervised joint embedding for cross-modal image-text retrieval. In: Proceedings of the 26th ACM international conference on Multimedia, pp 1856–1864

  28. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28

  29. Sanjay SP, Ezhilarasan N, Kumar MA, Soman KP (2015) Amrita-cen@ fire2015: automated story illustration using word embedding. In: FIRE workshops, pp 67–70

  30. Shi Y, Furlanello T, Zha S, Anandkumar A (2018) Question type guided attention in visual question answering

  31. Shrestha R, Kafle K, Kanan C (2019) Answer them all! toward universal visual question answering models

  32. Sikha OK, Soman KP (2021) Dynamic mode decomposition based salient edge/region features for content based image retrieval. Multim. Tools Appl. 80(10):15937–15958

    Article  Google Scholar 

  33. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  34. Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation

  35. Srihari K, Sikha OK (2022) Partially supervised image captioning model for urban road views. In: Intelligent data communication technologies and internet of things. Springer, pp 59–73

  36. Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International conference on machine learning. PMLR, pp 6105–6114

  37. Teney D, Anderson P, He X, van den Hengel A (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge

  38. Toor AS, Wechsler H, Nappi M (2018) Question action relevance and editing for visual question answering. Multimed Tools Appl 78:2921–2935

    Article  Google Scholar 

  39. Wang Z, Ji S (2018) Learning convolutional text representations for visual question answering. In: Proceedings of the 2018 SIAM international conference on data mining, pp 594–602

  40. Wu Q, Shen C, Wang P, Dick A, van den Hengel A (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40:1367–1381

    Article  Google Scholar 

  41. Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser L, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J (2016) Bridging the gap between human and machine translation. Google’s neural machine translation system

  42. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention

  43. Yang Z, He X, Gao J, Deng J, Smola A (2016) Stacked attention networks for image question answering

  44. Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4709–4717

  45. Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6281–6290

  46. Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29:5947–5959

    Article  Google Scholar 

  47. Zagoruyko S, Komodakis N (2016) Wide residual networks. arXiv:1605.07146

  48. Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-aware attention network for image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3536–3545

  49. Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv:1512.02167

  50. Zhu Y, Groth O, Bernstein M, Fei-Fei L (2016) Visual7w: grounded question answering in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4995–5004

Download references

Funding

The authors received no specific funding for this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anand Nayyar.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest to report regarding the present study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kovath, A.G., Nayyar, A. & Sikha, O.K. Multimodal attention-driven visual question answering for Malayalam. Neural Comput & Applic (2024). https://doi.org/10.1007/s00521-024-09818-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00521-024-09818-4

Keywords

Navigation