Choosing Linguistics over Vision to Describe Images

Ankush Gupta; Yashaswi Verma; C. Jawahar

doi:10.1609/aaai.v26i1.8205

Authors

Ankush Gupta International Institute of Information Technology, Hyderabad
Yashaswi Verma International Institute of Information Technology, Hyderabad
C. Jawahar International Institute of Information Technology, Hyderabad

DOI:

https://doi.org/10.1609/aaai.v26i1.8205

Abstract

In this paper, we address the problem of automatically generating human-like descriptions for unseen images, given a collection of images and their corresponding human-generated descriptions. Previous attempts for this task mostly rely on visual clues and corpus statistics, but do not take much advantage of the semantic information inherent in the available image descriptions. Here, we present a generic method which benefits from all these three sources (i.e. visual clues, corpus statistics and available descriptions) simultaneously, and is capable of constructing novel descriptions. Our approach works on syntactically and linguistically motivated phrases extracted from the human descriptions. Experimental evaluations demonstrate that our formulation mostly generates lucid and semantically correct descriptions, and significantly outperforms the previous methods on automatic evaluation metrics. One of the significant advantages of our approach is that we can generate multiple interesting descriptions for an image. Unlike any previous work, we also test the applicability of our method on a large dataset containing complex images with rich descriptions.

Choosing Linguistics over Vision to Describe Images

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription