abstract

Data Paucity and Low Resource Scenarios: Challenges and Opportunities

Author:
Mona Diab

George Washington University & Facebook AI, Seattle, WA, USA

George Washington University & Facebook AI, Seattle, WA, USA
View Profile

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningAugust 2020Pages 3612https://doi.org/10.1145/3394486.3409565

Published:20 August 2020Publication History

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 3612

ABSTRACT

In an era of unstructured data abundance, you would think that we have solved our data requirements for building robust systems for language processing. However, this is not the case if we think on a global scale with over 7000 languages where only a handful have digital resources. Systems at scale with good performance typically require annotated resources that cover the genres and domain divides. Moreover, the existence of a handful of resources in some languages is a reflection of the digital disparity in various societies leading to inadvertent biases in systems. In this talk I will show some solutions for low resource scenarios, both cross domain and genres as well as cross lingually.

I will talk about handling data paucity from the angle of devising principled metrics for data selection. Summarizing data samples by quantitative measures has a long history, with descriptive statistics being a case in point. However, as natural language processing methods flourish, there are still insufficient characteristic metrics to describe a collection of texts in terms of the words, sentences, or paragraphs they comprise. In this work, we propose metrics of diversity, density, and homogeneity that quantitatively measure the dispersion, sparsity, and uniformity of a text collection. We conduct a series of simulations to verify that each metric holds desired properties and resonates with human intuitions. Experiments on real-world datasets demonstrate that the proposed characteristic metrics are highly correlated with text classification performance of a renowned model, BERT, which could inspire future applications. We specifically look at the problem of Intent classification (IC) as well as sentiment analysis.

From the modeling side, for low resource scenarios for genres and domains, we investigate some techniques for few shot learning for the problems of intent classification (IC) and the sequence learning models for slot filling (SF) which are both core components in dialogue systems for task oriented chatbots. Current IC/SF models perform poorly when the number of training examples per class is small. We propose a new few-shot learning task, few-shot IC/SF, to study and improve the performance of IC and SF models on classes not seen at training time in ultra low resource scenarios. We establish a few-shot IC/SF benchmark. We show that two popular few-shot learning algorithms, model agnostic meta learning (MAML) and prototypical networks, outperform a fine tuning baseline on this benchmark.

From a multilingual perspective, we bootstrap cross lingual systems for inducing word and sentence level representations. Most existing methods for automatic bilingual dictionary induction rely on prior alignments between the source and target languages, such as parallel corpora or seed dictionaries. For many language pairs, such supervised alignments are not readily available. We propose an unsupervised approach for learning a bilingual dictionary for a pair of languages given their independently-learned monolingual word embeddings. The proposed method exploits local and global structures in monolingual vector spaces to align them such that similar words are mapped to each other. Finally, I will show you how we use projection for cross lingual emotion detection and semantic role labeling. We leverage a multitask learning framework coupled with an annotation projection method from a rich-resource language to a low-resource language through parallel data, and train a predictive models using the projected data.

Index Terms

Data Paucity and Low Resource Scenarios: Challenges and Opportunities
1. Information systems

Recommendations

Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages
Abstract
Unsupervised Neural Machine Translation (UNMT) approaches have gained widespread popularity in recent times. Though these approaches show impressive translation performance using only monolingual corpora of the languages involved, these approaches ...
Read More
Lingual-Agnostic Meta-Learning for Low-Resource Part-of-Speech Tagging
ICIT '20: Proceedings of the 2020 8th International Conference on Information Technology: IoT and Smart City

Current deep learning based cross-lingual Part-of-Speech (POS) tagging methods are limited by their ability to achieve fast learning and generalization when the data in the target language is scarce. In this paper, we integrate a meta-learning procedure ...
Read More
Improving Data Augmentation for Low-Resource NMT Guided by POS-Tagging and Paraphrase Embedding
Data augmentation is an approach for several text generation tasks. Generally, in the machine translation paradigm, mainly in low-resource language scenarios, many data augmentation methods have been proposed. The most used approaches for generating ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
August 2020
3664 pages
ISBN:9781450379984
DOI:10.1145/3394486
General Chairs:
Rajesh Gupta
UC San Diego, USA
,
Yan Liu
USC, USA
,
Program Chairs:
Mohak Shah
LG Electronics, USA
,
Suju Rajan
Linkedin, USA
,
Publications Chairs:
Jiliang Tang
Michigan State, USA
,
B. Aditya Prakash
Georgia Tech, USA
Copyright © 2020 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 August 2020
Check for updates
Author Tags
crosslingual modeling
few shot learning
intent classification
natural language processing
Qualifiers
- abstract
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 128
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data Paucity and Low Resource Scenarios: Challenges and Opportunities

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

Cited By

Index Terms

Recommendations

Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages

Lingual-Agnostic Meta-Learning for Low-Resource Part-of-Speech Tagging

Improving Data Augmentation for Low-Resource NMT Guided by POS-Tagging and Paraphrase Embedding

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Data Paucity and Low Resource Scenarios: Challenges and Opportunities

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

Cited By

Index Terms

Recommendations

Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages

Lingual-Agnostic Meta-Learning for Low-Resource Part-of-Speech Tagging

Improving Data Augmentation for Low-Resource NMT Guided by POS-Tagging and Paraphrase Embedding

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media