Neural Image Caption Generation with Visual Attention : Enabling Image Accessibility for the Visually Impaired

Priyanka Agarwal; Niveditha S; Shreyanth S; Sarveshwaran R; Rajesh P K

doi:10.32628/IJSRSET23103151

Authors

Priyanka Agarwal MTech in Data Science and Engineering, Birla Institute of Technology and Science, Pilani, Rajasthan, India
Niveditha S Department of Biotechnology, Rajalakshmi Engineering College, Thandalam, Chennai, Tamilnadu, India
Shreyanth S MTech in Data Science and Engineering, Birla Institute of Technology and Science, Pilani, Rajasthan, India
Sarveshwaran R MTech in Data Science and Engineering, Birla Institute of Technology and Science, Pilani, Rajasthan, India
Rajesh P K MTech in Data Science and Engineering, Birla Institute of Technology and Science, Pilani, Rajasthan, India

DOI:

https://doi.org//10.32628/IJSRSET23103151

Keywords:

Convolution Neural Network (CNN), Recurrent Neural Network (RNN), Attention Model, Natural Language Processing (NLP), NLTK, Machine Learning (ML), Deep Learning (DL), Flicker Dataset, gTTS – Google API

Abstract

The internet is saturated with images that convey messages and emotions more effectively than words alone in today's digital age. Individuals with visual impairments, who are unable to perceive and comprehend these images, face significant obstacles in this visual-centric online environment. As there are millions of visually impaired people around the globe, it is essential to close this accessibility gap and enable them to interact with online visual content. We propose a novel model for neural image caption generation with visual attention to address this pressing issue. Our model uses a combination of CNNs and RNNs to convert the content of images into aural descriptions, making them accessible to the visually impaired. The primary objective of our project is to generate captions that accurately and effectively describe the visual elements of an image. The model proposed operates in two phases. First, a text-to-speech API is utilized to convert the image's content into a textual description. The extracted textual description is then converted to audio, allowing visually impaired individuals to perceive visual information through sound. Through exhaustive experimentation and evaluation, we intend to achieve a high level of precision and descriptivism in our system for image captioning. We will evaluate the performance of the model by undertaking comprehensive qualitative and quantitative assessments, comparing its generated captions to ground truth captions annotated by humans. By enabling visually impaired individuals to access and comprehend online images, our research promotes digital inclusion and equality. It has the potential to improve the online experience for millions of visually impaired people, enabling them to interact with visual content and enriching their lives through meaningful image-based interactions.

References

P. Anderson et al., "Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6077-6086. https://doi.org/10.1109/CVPR.2018.00636
H. Sharma, M. Agrahari, S. K. Singh, M. Firoj and R. K. Mishra, "Image Captioning: A Comprehensive Survey," 2020 International Conference on Power Electronics & IoT Applications in Renewable Energy and its Control (PARC), Mathura, India, 2020, pp. 325-328. https://doi.org/10.1109/PARC49193.2020.236619
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., ... & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048-2057). https://doi.org/10.48550/arXiv.1502.03044
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597-1607). https://doi.org/10.48550/arXiv.2002.05709
Gu, J., Wang, G., Cai, J., Chen, T., & Li, C. (2021). Image captioning with semantic attention. Neural Networks, 137, 161-172. https://doi.org/10.48550/arXiv.1603.03925
Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, Li-Jia Li, "Deep Reinforcement Learning-based Image Captioning with Embedding Reward," ArXiv, 2017. https://doi.org/10.48550/arXiv.1704.03899
Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, Li Deng, "Semantic Compositional Networks for Visual Captioning," ArXiv, 2017. https://doi.org/10.48550/arXiv.1611.08002
J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang and W. Xu, "CNN-RNN: A Unified Framework for Multi-label Image Classification," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 2285-2294, https://doi.org/10.1109/CVPR.2016.251
S. Li, Z. Tao, K. Li and Y. Fu, "Visual to Text: Survey of Image and Video Captioning," in IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 3, no. 4, pp. 297-312, Aug. 2019, https://doi.org/10.1109/TETCI.2019.2892755
M. Z. Hossain, F. Sohel, M. F. Shiratuddin, H. Laga and M. Bennamoun, "Text to Image Synthesis for Improved Image Captioning," in IEEE Access, vol. 9, pp. 64918-64928, 2021, https://doi.org/10.1109/ACCESS.2021.3075579

Neural Image Caption Generation with Visual Attention : Enabling Image Accessibility for the Visually Impaired

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite