Choosing Linguistics over Vision to Describe Images

Authors

  • Ankush Gupta International Institute of Information Technology, Hyderabad
  • Yashaswi Verma International Institute of Information Technology, Hyderabad
  • C. Jawahar International Institute of Information Technology, Hyderabad

DOI:

https://doi.org/10.1609/aaai.v26i1.8205

Abstract

In this paper, we address the problem of automatically generating human-like descriptions for unseen images, given a collection of images and their corresponding human-generated descriptions. Previous attempts for this task mostly rely on visual clues and corpus statistics, but do not take much advantage of the semantic information inherent in the available image descriptions. Here, we present a generic method which benefits from all these three sources (i.e. visual clues, corpus statistics and available descriptions) simultaneously, and is capable of constructing novel descriptions. Our approach works on syntactically and linguistically motivated phrases extracted from the human descriptions. Experimental evaluations demonstrate that our formulation mostly generates lucid and semantically correct descriptions, and significantly outperforms the previous methods on automatic evaluation metrics. One of the significant advantages of our approach is that we can generate multiple interesting descriptions for an image. Unlike any previous work, we also test the applicability of our method on a large dataset containing complex images with rich descriptions.

Downloads

Published

2021-09-20

How to Cite

Gupta, A., Verma, Y., & Jawahar, C. (2021). Choosing Linguistics over Vision to Describe Images. Proceedings of the AAAI Conference on Artificial Intelligence, 26(1), 606-612. https://doi.org/10.1609/aaai.v26i1.8205