skip to main content
10.1145/3394486.3409565acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
abstract

Data Paucity and Low Resource Scenarios: Challenges and Opportunities

Authors Info & Claims
Published:20 August 2020Publication History

ABSTRACT

In an era of unstructured data abundance, you would think that we have solved our data requirements for building robust systems for language processing. However, this is not the case if we think on a global scale with over 7000 languages where only a handful have digital resources. Systems at scale with good performance typically require annotated resources that cover the genres and domain divides. Moreover, the existence of a handful of resources in some languages is a reflection of the digital disparity in various societies leading to inadvertent biases in systems. In this talk I will show some solutions for low resource scenarios, both cross domain and genres as well as cross lingually.

I will talk about handling data paucity from the angle of devising principled metrics for data selection. Summarizing data samples by quantitative measures has a long history, with descriptive statistics being a case in point. However, as natural language processing methods flourish, there are still insufficient characteristic metrics to describe a collection of texts in terms of the words, sentences, or paragraphs they comprise. In this work, we propose metrics of diversity, density, and homogeneity that quantitatively measure the dispersion, sparsity, and uniformity of a text collection. We conduct a series of simulations to verify that each metric holds desired properties and resonates with human intuitions. Experiments on real-world datasets demonstrate that the proposed characteristic metrics are highly correlated with text classification performance of a renowned model, BERT, which could inspire future applications. We specifically look at the problem of Intent classification (IC) as well as sentiment analysis.

From the modeling side, for low resource scenarios for genres and domains, we investigate some techniques for few shot learning for the problems of intent classification (IC) and the sequence learning models for slot filling (SF) which are both core components in dialogue systems for task oriented chatbots. Current IC/SF models perform poorly when the number of training examples per class is small. We propose a new few-shot learning task, few-shot IC/SF, to study and improve the performance of IC and SF models on classes not seen at training time in ultra low resource scenarios. We establish a few-shot IC/SF benchmark. We show that two popular few-shot learning algorithms, model agnostic meta learning (MAML) and prototypical networks, outperform a fine tuning baseline on this benchmark.

From a multilingual perspective, we bootstrap cross lingual systems for inducing word and sentence level representations. Most existing methods for automatic bilingual dictionary induction rely on prior alignments between the source and target languages, such as parallel corpora or seed dictionaries. For many language pairs, such supervised alignments are not readily available. We propose an unsupervised approach for learning a bilingual dictionary for a pair of languages given their independently-learned monolingual word embeddings. The proposed method exploits local and global structures in monolingual vector spaces to align them such that similar words are mapped to each other. Finally, I will show you how we use projection for cross lingual emotion detection and semantic role labeling. We leverage a multitask learning framework coupled with an annotation projection method from a rich-resource language to a low-resource language through parallel data, and train a predictive models using the projected data.

Index Terms

  1. Data Paucity and Low Resource Scenarios: Challenges and Opportunities

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
      August 2020
      3664 pages
      ISBN:9781450379984
      DOI:10.1145/3394486

      Copyright © 2020 Owner/Author

      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 August 2020

      Check for updates

      Qualifiers

      • abstract

      Acceptance Rates

      Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader