Visually grounded few-shot word learning in low-resource settings

Nortje, Leanne; Oneata, Dan; Kamper, Herman

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2306.11371 (eess)

[Submitted on 20 Jun 2023 (v1), last revised 18 Apr 2024 (this version, v3)]

Title:Visually grounded few-shot word learning in low-resource settings

Authors:Leanne Nortje, Dan Oneata, Herman Kamper

View PDF HTML (experimental)

Abstract:We propose a visually grounded speech model that learns new words and their visual depictions from just a few word-image example pairs. Given a set of test images and a spoken query, we ask the model which image depicts the query word. Previous work has simplified this few-shot learning problem by either using an artificial setting with digit word-image pairs or by using a large number of examples per class. Moreover, all previous studies were performed using English speech-image data. We propose an approach that can work on natural word-image pairs but with less examples, i.e. fewer shots, and then illustrate how this approach can be applied for multimodal few-shot learning in a real low-resource language, Yorùbá. Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelled speech and images. Additionally, we use a word-to-image attention mechanism to determine word-image similarity. With this new model, we achieve better performance with fewer shots than previous approaches on an existing English benchmark. Many of the model's mistakes are due to confusion between visual concepts co-occurring in similar contexts. The experiments on Yorùbá show the benefit of transferring knowledge from a multimodal model trained on a larger set of English speech-image data.

Comments:	Accepted to TASLP. arXiv admin note: substantial text overlap with arXiv:2305.15937
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
Cite as:	arXiv:2306.11371 [eess.AS]
	(or arXiv:2306.11371v3 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2306.11371

Submission history

From: Leanne Nortje [view email]
[v1] Tue, 20 Jun 2023 08:27:42 UTC (39,511 KB)
[v2] Wed, 21 Jun 2023 07:22:08 UTC (39,511 KB)
[v3] Thu, 18 Apr 2024 17:36:53 UTC (23,993 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Visually grounded few-shot word learning in low-resource settings

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Visually grounded few-shot word learning in low-resource settings

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators