Abstract
Visual question answering is a challenging task that necessitates for sophisticated reasoning over the visual elements to provide an accurate answer to a question. Majority of the state-of-the-art VQA models are only applicable to English questions. However, applications such as visual assistance and tourism necessitate the incorporation of multilingual VQA systems. This paper presents an effective deep learning framework for Malayalam visual question answering (MVQA), which can answer a specific natural language question about an image in Malayalam. As there is no available dataset in English–Malayalam VQA, a MVQA dataset was created by translating English question–answer pairs from the visual genome dataset. The paper proposes an attention-driven MVQA model on the developed dataset. The proposed MVQA model uses a deep learning-based co-attention mechanism to jointly learn the attention for images and Malayalam questions. A second-order multimodal factorized high-order pooling is used for multi modal feature fusion. Different VQA models using combinations of classical CNNs and RNNs were experimented on the developed MVQA dataset, and the performance was compared against the proposed attention-driven model. Experimental results show that the proposed attention-driven MVQA model achieves state-of-the-art results as compared to other models for MVQA on the custom Malayalam VQA dataset.
Similar content being viewed by others
Data availability
The authors declare that all the data being used in the design and production cum layout of the manuscript are declared in the manuscript.
References
Abacha AB, Hasan SA , Datla VV, Liu J, Demner-Fushman D, Müller H (2019) VAQ-Med: overview of the medical visual question answering task at ImageCLEF 2019. CLEF (Working Notes) 2(6)
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Vqa DP (2015) Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
Asher R (2013) Malayalam. Routledge, London, p 10
Bojanowski Piotr, Grave Edouard, Joulin Armand, Mikolov Tomas (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
Cascante-Bonilla P, Wu H, Wang L, Feris RS, Ordonez V (2022) Simvqa: Exploring simulated environments for visual question answering. arXiv:2203.17219 [cs], 03
Dancette C, Cadène R, Teney D, Cord M (2021) Beyond question-based biases: assessing multimodal shortcut learning in visual question answering
Devlin J, Chang M-W, Lee K, Toutanova K(2018) Bert: pre-training of deep bidirectional transformers for language understanding, p 10
Dey R, Salem FM (2017) Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS). IEEE, pp 1597–1600
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847 [cs], 09
Garlapati Anusha, Malisetty Neeraj, Narayanan Gayathri (2022) Image captioning from wikipedia for multi-language using deep learning models. Asian J Converg Technol (AJCT) 8(1):94–101 (ISSN-2350-1146)
Geman Donald, Geman Stuart, Hallonquist Neil, Younes Laurent (2015) Visual turing test for computer vision systems. Proc Natl Acad Sci 112(12):3618–3623
Girshick R (2015) Fast R-CNN. In: Proceedings of the IEEE international conference on computer vision, pp. 1440–1448
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering
Grave E, Bojanowski P, Gupta P, Joulin A, Mikolov T (2018) Learning word vectors for 157 languages. arXiv:1802.06893 [cs], 03
Gu J, Zhao H, Lin Z, Li S, Cai J, Ling M (2019) Scene graph generation with external knowledge and image reconstruction
Gupta D, Lenka P, Ekbal A, Bhattacharyya P (2020) A unified framework for multilingual and code-mixed visual question answering. In: Proceedings of the 1st conference of the Asia-Pacific chapter of the association for computational linguistics and the 10th international joint conference on natural language processing, pp. 900–913 (2020)
Gupta D, Lenka P, Ekbal A, Bhattacharyya P (2020) A unified framework for multilingual and code-mixed visual question answering, p 12
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Kazemi V, Elqursh A (2017) Show, ask, attend, and answer: a strong baseline for visual question answering. arXiv:1704.03162 [cs], 04
Lu J, Yang J, Batra D, Parikh D (2017) Dense-captioning events in videos
Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering
Malinowski M, Doersch C, Santoro A, Battaglia P (2018) Learning visual question answering by bootstrapping hard attention
Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input
Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: a neural-based approach to answering questions about images
Malinowski M, Rohrbach M, Fritz M (2017) Ask your neurons: a deep learning approach to visual question answering. Int J Comput Vis 125:110–135
Mithun NC, Panda R, Papalexakis EE, Roy-Chowdhury AK (2018) Webly supervised joint embedding for cross-modal image-text retrieval. In: Proceedings of the 26th ACM international conference on Multimedia, pp 1856–1864
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28
Sanjay SP, Ezhilarasan N, Kumar MA, Soman KP (2015) Amrita-cen@ fire2015: automated story illustration using word embedding. In: FIRE workshops, pp 67–70
Shi Y, Furlanello T, Zha S, Anandkumar A (2018) Question type guided attention in visual question answering
Shrestha R, Kafle K, Kanan C (2019) Answer them all! toward universal visual question answering models
Sikha OK, Soman KP (2021) Dynamic mode decomposition based salient edge/region features for content based image retrieval. Multim. Tools Appl. 80(10):15937–15958
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation
Srihari K, Sikha OK (2022) Partially supervised image captioning model for urban road views. In: Intelligent data communication technologies and internet of things. Springer, pp 59–73
Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International conference on machine learning. PMLR, pp 6105–6114
Teney D, Anderson P, He X, van den Hengel A (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge
Toor AS, Wechsler H, Nappi M (2018) Question action relevance and editing for visual question answering. Multimed Tools Appl 78:2921–2935
Wang Z, Ji S (2018) Learning convolutional text representations for visual question answering. In: Proceedings of the 2018 SIAM international conference on data mining, pp 594–602
Wu Q, Shen C, Wang P, Dick A, van den Hengel A (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40:1367–1381
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser L, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J (2016) Bridging the gap between human and machine translation. Google’s neural machine translation system
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention
Yang Z, He X, Gao J, Deng J, Smola A (2016) Stacked attention networks for image question answering
Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4709–4717
Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6281–6290
Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29:5947–5959
Zagoruyko S, Komodakis N (2016) Wide residual networks. arXiv:1605.07146
Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-aware attention network for image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3536–3545
Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv:1512.02167
Zhu Y, Groth O, Bernstein M, Fei-Fei L (2016) Visual7w: grounded question answering in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4995–5004
Funding
The authors received no specific funding for this study.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest to report regarding the present study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kovath, A.G., Nayyar, A. & Sikha, O.K. Multimodal attention-driven visual question answering for Malayalam. Neural Comput & Applic (2024). https://doi.org/10.1007/s00521-024-09818-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00521-024-09818-4