SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

Song, Haiyue; Dabre, Raj; Chu, Chenhui; Kurohashi, Sadao; Sumita, Eiichiro

doi:10.1145/3610611

Abstract:Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT). Existing work has shown that neural sub-word segmenters are better than Byte-Pair Encoding (BPE), however, they are inefficient as they require parallel corpora, days to train and hours to decode. This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method that is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora. SelfSeg takes as input a word in the form of a partially masked character sequence, optimizes the word generation probability and generates the segmentation with the maximum posterior probability, which is calculated using a dynamic programming algorithm. The training time of SelfSeg depends on word frequencies, and we explore several word frequency normalization strategies to accelerate the training phase. Additionally, we propose a regularization mechanism that allows the segmenter to generate various segmentations for one word. To show the effectiveness of our approach, we conduct MT experiments in low-, middle- and high-resource scenarios, where we compare the performance of using different segmentation methods. The experimental results demonstrate that on the low-resource ALT dataset, our method achieves more than 1.2 BLEU score improvement compared with BPE and SentencePiece, and a 1.1 score improvement over Dynamic Programming Encoding (DPE) and Vocabulary Learning via Optimal Transport (VOLT) on average. The regularization method achieves approximately a 4.3 BLEU score improvement over BPE and a 1.2 BLEU score improvement over BPE-dropout, the regularized version of BPE. We also observed significant improvements on IWSLT15 Vi->En, WMT16 Ro->En and WMT15 Fi->En datasets, and competitive results on the WMT14 De->En and WMT14 Fr->En datasets.

Comments:	Accepted to TALLIP journal
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2307.16400 [cs.CL]
	(or arXiv:2307.16400v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2307.16400
Related DOI:	https://doi.org/10.1145/3610611

Computer Science > Computation and Language

Title:SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators