cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation

Gupta, Kshitij; Gautam, Devansh; Mamidi, Radhika

Computer Science > Computation and Language

arXiv:2206.03354 (cs)

[Submitted on 7 Jun 2022 (v1), last revised 9 Jun 2022 (this version, v2)]

Title:cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation

Authors:Kshitij Gupta, Devansh Gautam, Radhika Mamidi

View PDF

Abstract:Vision-and-language tasks are gaining popularity in the research community, but the focus is still mainly on English. We propose a pipeline that utilizes English-only vision-language models to train a monolingual model for a target language. We propose to extend OSCAR+, a model which leverages object tags as anchor points for learning image-text alignments, to train on visual question answering datasets in different languages. We propose a novel approach to knowledge distillation to train the model in other languages using parallel sentences. Compared to other models that use the target language in the pretraining corpora, we can leverage an existing English model to transfer the knowledge to the target language using significantly lesser resources. We also release a large-scale visual question answering dataset in Japanese and Hindi language. Though we restrict our work to visual question answering, our model can be extended to any sequence-level classification task, and it can be extended to other languages as well. This paper focuses on two languages for the visual question answering task - Japanese and Hindi. Our pipeline outperforms the current state-of-the-art models by a relative increase of 4.4% and 13.4% respectively in accuracy.

Comments:	Accepted at ICPR 2022; 9 pages
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2206.03354 [cs.CL]
	(or arXiv:2206.03354v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2206.03354

Submission history

From: Kshitij Gupta [view email]
[v1] Tue, 7 Jun 2022 14:46:30 UTC (11,286 KB)
[v2] Thu, 9 Jun 2022 05:40:02 UTC (11,287 KB)

Computer Science > Computation and Language

Title:cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators