Neural Image Caption Generation with Visual Attention : Enabling Image Accessibility for the Visually Impaired

Authors

  • Priyanka Agarwal  MTech in Data Science and Engineering, Birla Institute of Technology and Science, Pilani, Rajasthan, India
  • Niveditha S  Department of Biotechnology, Rajalakshmi Engineering College, Thandalam, Chennai, Tamilnadu, India
  • Shreyanth S  MTech in Data Science and Engineering, Birla Institute of Technology and Science, Pilani, Rajasthan, India
  • Sarveshwaran R  MTech in Data Science and Engineering, Birla Institute of Technology and Science, Pilani, Rajasthan, India
  • Rajesh P K  MTech in Data Science and Engineering, Birla Institute of Technology and Science, Pilani, Rajasthan, India

DOI:

https://doi.org//10.32628/IJSRSET23103151

Keywords:

Convolution Neural Network (CNN), Recurrent Neural Network (RNN), Attention Model, Natural Language Processing (NLP), NLTK, Machine Learning (ML), Deep Learning (DL), Flicker Dataset, gTTS – Google API

Abstract

The internet is saturated with images that convey messages and emotions more effectively than words alone in today's digital age. Individuals with visual impairments, who are unable to perceive and comprehend these images, face significant obstacles in this visual-centric online environment. As there are millions of visually impaired people around the globe, it is essential to close this accessibility gap and enable them to interact with online visual content. We propose a novel model for neural image caption generation with visual attention to address this pressing issue. Our model uses a combination of CNNs and RNNs to convert the content of images into aural descriptions, making them accessible to the visually impaired. The primary objective of our project is to generate captions that accurately and effectively describe the visual elements of an image. The model proposed operates in two phases. First, a text-to-speech API is utilized to convert the image's content into a textual description. The extracted textual description is then converted to audio, allowing visually impaired individuals to perceive visual information through sound. Through exhaustive experimentation and evaluation, we intend to achieve a high level of precision and descriptivism in our system for image captioning. We will evaluate the performance of the model by undertaking comprehensive qualitative and quantitative assessments, comparing its generated captions to ground truth captions annotated by humans. By enabling visually impaired individuals to access and comprehend online images, our research promotes digital inclusion and equality. It has the potential to improve the online experience for millions of visually impaired people, enabling them to interact with visual content and enriching their lives through meaningful image-based interactions.

References

  1. P. Anderson et al., "Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6077-6086. https://doi.org/10.1109/CVPR.2018.00636
  2. H. Sharma, M. Agrahari, S. K. Singh, M. Firoj and R. K. Mishra, "Image Captioning: A Comprehensive Survey," 2020 International Conference on Power Electronics & IoT Applications in Renewable Energy and its Control (PARC), Mathura, India, 2020, pp. 325-328. https://doi.org/10.1109/PARC49193.2020.236619
  3. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., ... & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048-2057). https://doi.org/10.48550/arXiv.1502.03044
  4. Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597-1607). https://doi.org/10.48550/arXiv.2002.05709
  5. Gu, J., Wang, G., Cai, J., Chen, T., & Li, C. (2021). Image captioning with semantic attention. Neural Networks, 137, 161-172. https://doi.org/10.48550/arXiv.1603.03925
  6. Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, Li-Jia Li, "Deep Reinforcement Learning-based Image Captioning with Embedding Reward," ArXiv, 2017. https://doi.org/10.48550/arXiv.1704.03899
  7. Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, Li Deng, "Semantic Compositional Networks for Visual Captioning," ArXiv, 2017. https://doi.org/10.48550/arXiv.1611.08002
  8. J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang and W. Xu, "CNN-RNN: A Unified Framework for Multi-label Image Classification," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 2285-2294, https://doi.org/10.1109/CVPR.2016.251
  9. S. Li, Z. Tao, K. Li and Y. Fu, "Visual to Text: Survey of Image and Video Captioning," in IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 3, no. 4, pp. 297-312, Aug. 2019, https://doi.org/10.1109/TETCI.2019.2892755
  10. M. Z. Hossain, F. Sohel, M. F. Shiratuddin, H. Laga and M. Bennamoun, "Text to Image Synthesis for Improved Image Captioning," in IEEE Access, vol. 9, pp. 64918-64928, 2021, https://doi.org/10.1109/ACCESS.2021.3075579

Downloads

Published

2023-06-30

Issue

Section

Research Articles

How to Cite

[1]
Priyanka Agarwal, Niveditha S, Shreyanth S, Sarveshwaran R, Rajesh P K, " Neural Image Caption Generation with Visual Attention : Enabling Image Accessibility for the Visually Impaired, International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 10, Issue 3, pp.562-575, May-June-2023. Available at doi : https://doi.org/10.32628/IJSRSET23103151