Multimodal attention-driven visual question answering for Malayalam

Kovath, Abhishek Gopinath; Nayyar, Anand; Sikha, O. K.

doi:10.1007/s00521-024-09818-4

Multimodal attention-driven visual question answering for Malayalam

Original Article
Published: 10 May 2024

(2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

125 Accesses
Explore all metrics

Abstract

Visual question answering is a challenging task that necessitates for sophisticated reasoning over the visual elements to provide an accurate answer to a question. Majority of the state-of-the-art VQA models are only applicable to English questions. However, applications such as visual assistance and tourism necessitate the incorporation of multilingual VQA systems. This paper presents an effective deep learning framework for Malayalam visual question answering (MVQA), which can answer a specific natural language question about an image in Malayalam. As there is no available dataset in English–Malayalam VQA, a MVQA dataset was created by translating English question–answer pairs from the visual genome dataset. The paper proposes an attention-driven MVQA model on the developed dataset. The proposed MVQA model uses a deep learning-based co-attention mechanism to jointly learn the attention for images and Malayalam questions. A second-order multimodal factorized high-order pooling is used for multi modal feature fusion. Different VQA models using combinations of classical CNNs and RNNs were experimented on the developed MVQA dataset, and the performance was compared against the proposed attention-driven model. Experimental results show that the proposed attention-driven MVQA model achieves state-of-the-art results as compared to other models for MVQA on the custom Malayalam VQA dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 6

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

Article 13 September 2023

SCBM: A Hybrid Model for Vietnamese Visual Question Answering

Multimodal Attention for Visual Question Answering

Data availability

The authors declare that all the data being used in the design and production cum layout of the manuscript are declared in the manuscript.

References

Abacha AB, Hasan SA , Datla VV, Liu J, Demner-Fushman D, Müller H (2019) VAQ-Med: overview of the medical visual question answering task at ImageCLEF 2019. CLEF (Working Notes) 2(6)
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Vqa DP (2015) Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
Asher R (2013) Malayalam. Routledge, London, p 10
Book Google Scholar
Bojanowski Piotr, Grave Edouard, Joulin Armand, Mikolov Tomas (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
Article Google Scholar
Cascante-Bonilla P, Wu H, Wang L, Feris RS, Ordonez V (2022) Simvqa: Exploring simulated environments for visual question answering. arXiv:2203.17219 [cs], 03
Dancette C, Cadène R, Teney D, Cord M (2021) Beyond question-based biases: assessing multimodal shortcut learning in visual question answering
Devlin J, Chang M-W, Lee K, Toutanova K(2018) Bert: pre-training of deep bidirectional transformers for language understanding, p 10
Dey R, Salem FM (2017) Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS). IEEE, pp 1597–1600
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847 [cs], 09
Garlapati Anusha, Malisetty Neeraj, Narayanan Gayathri (2022) Image captioning from wikipedia for multi-language using deep learning models. Asian J Converg Technol (AJCT) 8(1):94–101 (ISSN-2350-1146)
Article Google Scholar
Geman Donald, Geman Stuart, Hallonquist Neil, Younes Laurent (2015) Visual turing test for computer vision systems. Proc Natl Acad Sci 112(12):3618–3623
Article Google Scholar
Girshick R (2015) Fast R-CNN. In: Proceedings of the IEEE international conference on computer vision, pp. 1440–1448
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering
Grave E, Bojanowski P, Gupta P, Joulin A, Mikolov T (2018) Learning word vectors for 157 languages. arXiv:1802.06893 [cs], 03
Gu J, Zhao H, Lin Z, Li S, Cai J, Ling M (2019) Scene graph generation with external knowledge and image reconstruction
Gupta D, Lenka P, Ekbal A, Bhattacharyya P (2020) A unified framework for multilingual and code-mixed visual question answering. In: Proceedings of the 1st conference of the Asia-Pacific chapter of the association for computational linguistics and the 10th international joint conference on natural language processing, pp. 900–913 (2020)
Gupta D, Lenka P, Ekbal A, Bhattacharyya P (2020) A unified framework for multilingual and code-mixed visual question answering, p 12
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Kazemi V, Elqursh A (2017) Show, ask, attend, and answer: a strong baseline for visual question answering. arXiv:1704.03162 [cs], 04
Lu J, Yang J, Batra D, Parikh D (2017) Dense-captioning events in videos
Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering
Malinowski M, Doersch C, Santoro A, Battaglia P (2018) Learning visual question answering by bootstrapping hard attention
Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input
Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: a neural-based approach to answering questions about images
Malinowski M, Rohrbach M, Fritz M (2017) Ask your neurons: a deep learning approach to visual question answering. Int J Comput Vis 125:110–135
Article MathSciNet Google Scholar
Mithun NC, Panda R, Papalexakis EE, Roy-Chowdhury AK (2018) Webly supervised joint embedding for cross-modal image-text retrieval. In: Proceedings of the 26th ACM international conference on Multimedia, pp 1856–1864
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28
Sanjay SP, Ezhilarasan N, Kumar MA, Soman KP (2015) Amrita-cen@ fire2015: automated story illustration using word embedding. In: FIRE workshops, pp 67–70
Shi Y, Furlanello T, Zha S, Anandkumar A (2018) Question type guided attention in visual question answering
Shrestha R, Kafle K, Kanan C (2019) Answer them all! toward universal visual question answering models
Sikha OK, Soman KP (2021) Dynamic mode decomposition based salient edge/region features for content based image retrieval. Multim. Tools Appl. 80(10):15937–15958
Article Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation
Srihari K, Sikha OK (2022) Partially supervised image captioning model for urban road views. In: Intelligent data communication technologies and internet of things. Springer, pp 59–73
Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International conference on machine learning. PMLR, pp 6105–6114
Teney D, Anderson P, He X, van den Hengel A (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge
Toor AS, Wechsler H, Nappi M (2018) Question action relevance and editing for visual question answering. Multimed Tools Appl 78:2921–2935
Article Google Scholar
Wang Z, Ji S (2018) Learning convolutional text representations for visual question answering. In: Proceedings of the 2018 SIAM international conference on data mining, pp 594–602
Wu Q, Shen C, Wang P, Dick A, van den Hengel A (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40:1367–1381
Article Google Scholar
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser L, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J (2016) Bridging the gap between human and machine translation. Google’s neural machine translation system
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention
Yang Z, He X, Gao J, Deng J, Smola A (2016) Stacked attention networks for image question answering
Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4709–4717
Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6281–6290
Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29:5947–5959
Article Google Scholar
Zagoruyko S, Komodakis N (2016) Wide residual networks. arXiv:1605.07146
Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-aware attention network for image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3536–3545
Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv:1512.02167
Zhu Y, Groth O, Bernstein M, Fei-Fei L (2016) Visual7w: grounded question answering in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4995–5004

Download references

Funding

The authors received no specific funding for this study.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Amrita School of Computing, Coimbatore, Amrita Vishwa Vidyapeetham, Coimbatore, India
Abhishek Gopinath Kovath & O. K. Sikha
Graduate School, Faculty of Information Technology, Duy Tan University, Da Nang, 550000, Viet Nam
Anand Nayyar

Authors

Abhishek Gopinath Kovath
View author publications
You can also search for this author in PubMed Google Scholar
Anand Nayyar
View author publications
You can also search for this author in PubMed Google Scholar
O. K. Sikha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anand Nayyar.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest to report regarding the present study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Kovath, A.G., Nayyar, A. & Sikha, O.K. Multimodal attention-driven visual question answering for Malayalam. Neural Comput & Applic (2024). https://doi.org/10.1007/s00521-024-09818-4

Download citation

Received: 23 March 2023
Accepted: 15 April 2024
Published: 10 May 2024
DOI: https://doi.org/10.1007/s00521-024-09818-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal attention-driven visual question answering for Malayalam

Abstract

Access this article

Similar content being viewed by others

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

SCBM: A Hybrid Model for Vietnamese Visual Question Answering

Multimodal Attention for Visual Question Answering

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multimodal attention-driven visual question answering for Malayalam

Abstract

Access this article

Similar content being viewed by others

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

SCBM: A Hybrid Model for Vietnamese Visual Question Answering

Multimodal Attention for Visual Question Answering

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation