Effects and Mitigation of Out-of-vocabulary in Universal Language Models

Sangwhan Moon; Naoaki Okazaki

doi:10.2197/ipsjjip.29.490

Abstract

One of the most important recent natural language processing (NLP) trends is transfer learning - using representations from language models implemented through a neural network to perform other tasks. While transfer learning is a promising and robust method, downstream task performance in transfer learning depends on the robustness of the backbone model's vocabulary, which in turn represents both the positive and negative characteristics of the corpus used to train it. With subword tokenization, out-of-vocabulary (OOV) is generally assumed to be a solved problem. Still, in languages with a large alphabet such as Chinese, Japanese, and Korean (CJK), this assumption does not hold. In our work, we demonstrate the adverse effects of OOV in the context of transfer learning in CJK languages, then propose a novel approach to maximize the utility of a pre-trained model suffering from OOV. Additionally, we further investigate the correlation of OOV to task performance and explore if and how mitigation can salvage a model with high OOV.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!