ABSTRACT
In an era of unstructured data abundance, you would think that we have solved our data requirements for building robust systems for language processing. However, this is not the case if we think on a global scale with over 7000 languages where only a handful have digital resources. Systems at scale with good performance typically require annotated resources that cover the genres and domain divides. Moreover, the existence of a handful of resources in some languages is a reflection of the digital disparity in various societies leading to inadvertent biases in systems. In this talk I will show some solutions for low resource scenarios, both cross domain and genres as well as cross lingually.
I will talk about handling data paucity from the angle of devising principled metrics for data selection. Summarizing data samples by quantitative measures has a long history, with descriptive statistics being a case in point. However, as natural language processing methods flourish, there are still insufficient characteristic metrics to describe a collection of texts in terms of the words, sentences, or paragraphs they comprise. In this work, we propose metrics of diversity, density, and homogeneity that quantitatively measure the dispersion, sparsity, and uniformity of a text collection. We conduct a series of simulations to verify that each metric holds desired properties and resonates with human intuitions. Experiments on real-world datasets demonstrate that the proposed characteristic metrics are highly correlated with text classification performance of a renowned model, BERT, which could inspire future applications. We specifically look at the problem of Intent classification (IC) as well as sentiment analysis.
From the modeling side, for low resource scenarios for genres and domains, we investigate some techniques for few shot learning for the problems of intent classification (IC) and the sequence learning models for slot filling (SF) which are both core components in dialogue systems for task oriented chatbots. Current IC/SF models perform poorly when the number of training examples per class is small. We propose a new few-shot learning task, few-shot IC/SF, to study and improve the performance of IC and SF models on classes not seen at training time in ultra low resource scenarios. We establish a few-shot IC/SF benchmark. We show that two popular few-shot learning algorithms, model agnostic meta learning (MAML) and prototypical networks, outperform a fine tuning baseline on this benchmark.
From a multilingual perspective, we bootstrap cross lingual systems for inducing word and sentence level representations. Most existing methods for automatic bilingual dictionary induction rely on prior alignments between the source and target languages, such as parallel corpora or seed dictionaries. For many language pairs, such supervised alignments are not readily available. We propose an unsupervised approach for learning a bilingual dictionary for a pair of languages given their independently-learned monolingual word embeddings. The proposed method exploits local and global structures in monolingual vector spaces to align them such that similar words are mapped to each other. Finally, I will show you how we use projection for cross lingual emotion detection and semantic role labeling. We leverage a multitask learning framework coupled with an annotation projection method from a rich-resource language to a low-resource language through parallel data, and train a predictive models using the projected data.
Index Terms
- Data Paucity and Low Resource Scenarios: Challenges and Opportunities
Recommendations
Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages
AbstractUnsupervised Neural Machine Translation (UNMT) approaches have gained widespread popularity in recent times. Though these approaches show impressive translation performance using only monolingual corpora of the languages involved, these approaches ...
Lingual-Agnostic Meta-Learning for Low-Resource Part-of-Speech Tagging
ICIT '20: Proceedings of the 2020 8th International Conference on Information Technology: IoT and Smart CityCurrent deep learning based cross-lingual Part-of-Speech (POS) tagging methods are limited by their ability to achieve fast learning and generalization when the data in the target language is scarce. In this paper, we integrate a meta-learning procedure ...
Improving Data Augmentation for Low-Resource NMT Guided by POS-Tagging and Paraphrase Embedding
Data augmentation is an approach for several text generation tasks. Generally, in the machine translation paradigm, mainly in low-resource language scenarios, many data augmentation methods have been proposed. The most used approaches for generating ...
Comments