An Empirical Study of Korean Sentence Representation with Various Tokenizations

: It is important how the token unit is deﬁned in a sentence in natural language process tasks, such as text classiﬁcation, machine translation, and generation. Many studies recently utilized the subword tokenization in language models such as BERT, KoBERT


Introduction
Embedding is a fundamental step when representing text data in vector space for natural language process tasks, namely, text classification, machine translation, and generation. To represent discrete data such as words and sentences in vector space, much research has been conducted on various embedding methods based on words, morphemes, and subwords. Most embedding research has focused on word embedding, which learns a vector from the word token unit [1][2][3][4]. However, word embedding is problematic due to unknown tokens raising the out-of-vocabulary (OOV) problem. To alleviate the OOV problem, morpheme embedding was introduced [5,6]. Since a word is decomposed into a sequence of morpheme tokens based on morphological meanings, morpheme embedding has fewer unknown tokens and is more robust to the OOV problem than word embedding [5,7]. Subword tokenization is an attractive approach in sentence embedding researches [8,9] and it acquired a good performance on the political-bias classification task [10]. The subword token is decomposed by data-driven statistical algorithms, for example, byte pair encoding (BPE) and SentencePiece (https://github.com/google/sentencepiece) (accessed on 9 December 2020) [11,12]. It was proved that subword embedding is effective for machine translation tasks [8,9], and the language models based on the transformer, namely, BERT and KoBERT, achieved state-of-the-art results on sentiment analysis by conducting subword tokenization [13,14]. However, it is not clear whether subword tokenization is an effective method for Korean sentence embedding. We raise the question: "What is the optimal token unit for Korean sentence embedding?" To find the answer, we explored sentiment analysis task with sentence representation methods, which is known as sentence embedding, based on the various token units (word, morpheme, subword, and submorpheme).
The language models based on the transformer (e.g., BERT, KoBERT, and ALBERT) learn the pretraining tasks with the large corpora and subword tokenization for improving performance [13][14][15]. However, because we focused on determining whether subword tokenization is an important factor for improving performance, we controlled other factors for improving performance, excluding tokenization for sentence embedding. To control factors other than embedding methods, we used simple classifiers, such as a support vector machine (SVM) (https://www.cs.cornell.edu/people/tj/svm_light/) (accessed on 9 December 2020), multi-layer perceptron (MLP), and long short-term memory (LSTM). That is, we constructed the sentence embedding by applying the token sequence based on the various token units to the sentence representation methods and evaluated the performance of sentence embedding using the simple classifiers on a Korean sentiment analysis task. Our research is structured as follows. In Section 2, we introduce the related works of the sentence representation methods based on tokenization. In Section 3, we describe the properties of token units in the Korean language and how to obtain the sentence vector from the sequence of token units in a sentence. In Section 4, we evaluate the performance of the sentence embedding according to the token unit defining the sentence in the Korean sentiment analysis task. In Section 5, we analyze the experimental results, considering the properties of token units. Lastly, Section 6 summarizes our task and presents the expected effect.

Related Work
Most previous studies inputted the word token unit in the embedding models, namely, Word2vec and GloVe [1][2][3][16][17][18], to represent the word as a continuous vector. Tang (2014) and Severyn (2015) conducted the word embedding by inputting word tokens in skip-gram for the sentence classification task in English [1,2]. Zhao (2018) also explored the word embedding in English utilizing GloVe to eliminate the gender stereotype caused by the biased data [4]. Lee (2019) carried out CBOW for word embedding on the Korean spam message filtering task [3]. The morpheme has the advantage of expressing an internal meaning of a word because a word is decomposed into a sequence of morpheme tokens by morphological meaning [5][6][7]. In English, Botha (2014) verified that morpheme embedding is effective to represent the meaning of tokens than the word embedding, and Tang (2020) introduced the tokenization strategies for resolving the OOV problem based on morpheme embedding [5,7].  combined morphological features in a word to demonstrate the effect of morpheme embedding in the Korean language [6]. Subword tokenization was recently introduced with BPE and SentencePiece algorithms [8][9][10]. Banerjee (2018) and Wang (2020) tackled the OOV problem by utilizing the BPE algorithm on the machine translation task using multiple languages [8,9]. Cho (2020a) explored subword embedding with SentencePiece for the bias classification task using the Korean news article dataset [10]. In addition, BERT was introduced by Devlin (2019) as the pretraining language model based on a transformer, and it was expanded to KoBERT and ALBERT [13][14][15]. These utilize the subword embedding to pretrain the large corpora, and these achieved the state-of-the-art results on various NLP tasks.

Tokenization
Tokenization lists a sentence as a token sequence. As shown in Table 1, tokenization results differ depending on the token unit. Thus, we describe the tokenization results with properties of the Korean language in the order of word, morpheme, subword, and submorpheme. In all tokenization, we did not deal with punctuation marks. Table 1. Example of tokenization results according to token units for "너무나 감동적인 영화" (very touching movie); We marked the token, which is we can not represent in English, as "-" (hyphen).
The morpheme is the smallest token unit semantically and has the advantages of alleviating the complexity of morphological meaning and decreasing the number of unknown tokens. Morpheme tokenization represents the sentence of Table 1 as "너무나" (very), "감동" (touch), "적" (-), "인" (ing), and "영화" (movie). As shown in the result, we found that morpheme tokenization decomposes the sentence more finely than word tokenization. For example, "감동적인" (touching) in word tokenization is decomposed to "감동" (touch), "적" (-), and "인" (ing) in morpheme tokenization. In other words, because morpheme tokenization alleviates the complexity of morphological meaning in a word and decreases the number of unknown tokens, it is effective at resolving the OOV problem than the word.
A subword is decomposed by SentencePiece in our work. SentencePiece is a datadriven statistical algorithm based on a language model [11]. SentencePiece replaces the whitespace in a sentence with a particular symbol, "_" (underbar), to restore the sentence for the machine translation task [19], and decomposes a sentence into a subword sequence. For example, "너무나 감동적인 영화" is decomposed to "_너무" (_very), "나" (-), "_감동" (_touch), "적인" (-ing), and "_영화" (_movie) in the tokenization result of the subwords of Table 1. However, we found that "_영화" (_movie) and "영화" (movie) have different vectors, even though two tokens have the same meaning as "movie". We hypothesize that it is unnecessary to replace the whitespace in the sentiment analysis task because replacement is for restoring sentences in the machine translation task. To verify our hypothesis, we carried out the subword task in two cases: SWU (SubWord with Underbar) and SWT (SubWord Token without underbar). SWU is the original result of SentencePiece, so the tokenization results of SWU include a particular symbol for replacing the whitespace "_" (underbar). In contrast, SWT removes the symbol from SWU. It means that the SWT removes the symbol after applying a sentence to SentencePiece. For example, "_영화" (_movie) in SWU is transformed to "영화" (movie) in SWT.
A submorpheme is created by applying the morpheme sequence to the SentencePiece after tokenizing a sentence by the morphological analyzer. We expected that if we applied a morpheme sequence to SentencePiece, the performance of sentence embedding with subword tokenization would improve because it has been proven that morpheme-based sentence embedding outperforms word-based sentence embedding. Thus, we thought of a way to apply a morpheme sequence to SentencePiece instead of a word sequence, called the a submorpheme. For example, the tokenization results of submorpheme are represented as "_너무" (_very), "나" (-), "_감동" (_touch), "_적" (_-), "_인" (_ing), and "_영화" (_movie) based on morpheme tokenization; "너무나" (very), "감동" (touch), "적" (-), "인" (ing), and "영화" (movie). As the subword tokenization, we explore submorpheme tokenization in two cases: SMU (SubMorpheme with Underbar) and SMT (SubMorpheme Token without underbar) in the same way as subword. That is, SMU is the result of applying a morpheme sequence to SentencePiece, whereas SMT removes the symbol from SMU.

Sentence Representation Methods
There are two methods with which to represent a sentence vector. One is averaging sequential token vectors of a sentence. This method makes the sentence vector by averaging the sequential token vectors without considering the order of tokens in the sentence. When a sentence consists of a token sequence S = (t 1 , t 2 , ..., t n ), where t i is the i-th token in the sentence, we compose the sentence vector V S as average(v t 1 , v t 2 , ..., v t n ), where v t i is the vector of i-th token t i . The other considers the order of tokens in a sentence by sequentially inputting the token vector of sentence in LSTM. We carried out the two methods to compose a sentence vector according to whether the order of tokens in a sentence is considered. Before representing the sentence vector, we pretrained the downstream task dataset as a continuous vector by using skip-gram (https://code.google.com/archive/p/word2vec/) (accessed on 9 December 2020) [16,17]. We set up the hyperparameters: iteration 300, min-count 1, window size 5; and then vector sizes 200, 250, and 300, respectively. When pretraining the downstream task dataset, we used the only trainset.
In the sentence representation method ignoring the order of tokens in a sentence, we averaged the token vectors, which mapped the tokens of the input sentence to the pretrained vector, to compose a sentence vector. The sentence vector averaging the sequential token vectors was input into two classifiers; SVM and MLP. To verify the performance of the classifiers, we divided the trainset in a ratio of 8:2 in all experiments. SVM aims to maximize the distance between the decision boundary and support vector [20]. In the experiment of SVM, we used hinge loss from Equation (1) by setting ∆ to 1 when minimizing the loss between actual score y and predicted scoreŷ.
We implemented the MLP simplifying network with hidden units of half of the vector size and set up the hyperparameters as follows: ReLU as an activation function in the hidden layer; stochastic gradient descent (SGD) as an optimizer with a learning rate of 0.001, 32 batch size, and 100 epochs. We minimized the loss using the cross-entropy function through Equation (2). In the output layer, we used the softmax function to output the predicted probability for the input vector.

Cross Entropy Loss
LSTM is effective to process sequential data, for example, the sensor data for detecting abnormal kick patterns on Taekwondo matches [21]. It is because the weight of LSTM in time t is trained by including the histories of previous time {(t − 1), (t − 2), ...} for sequential data [22,23]. To consider the order of tokens in a sentence, which is likewise sequential data, we construct the sentence embedding by sequentially training the histories of previous times in LSTM. We input sequentially the token vector, which maps the token of sentence to pretrained vector, to LSTM. LSTM is constructed by hidden units of 128 and using tanh as the activation function. We output the sequential vectors of hidden units at every time t and concatenated them. The loss of LSTM also is minimized by the cross-entropy function from Equation (2), as for MLP. We divided the trainset into a ratio of 8:2 to validate the performance of the model and set up the hyperparameters of LSTM as SGD with a learning rate of 0.001, 32 batch size, and 100 epochs. LSTM outputs the probability for input sentence by a softmax function in the output layer.

Experiments
We empirically investigate the effective token unit for the Korean sentence embedding here. In detail, we focus on the following three research questions: 1.
Which tokenization is more robust to the OOV problem? 2.
Is the symbol replacing whitespaces meaningful in the Korean sentiment analysis? 3.
What is the optimal vocabulary size for sentence embedding?

Dataset and Token Analysis
We were inspired by Cho (2020a), which is carried out the political-bias classification with the news article dataset, to find optimal tokenization for Korean sentences [10]. In Korean, the informal text includes many variations of token unlike the formal text such as the news article dataset. For example, "강력한 추천" (strong recommendation) is abbreviated as "강추". Additionally, "대박" (wow) is a coined word in Korean and "멋있다" (nice) makes a typo as "머싰다". Because these properties of informal text are heavily influenced by the token unit that makes up a sentence, we used the informal text, naver sentiment movie corpus (NSMC) (https://github.com/e9t/nsmc) (accessed on 2 December 2020), for the sentiment analysis task in Korean. The NSMC dataset has a trainset (150,000 reviews) and a testset (50,000 reviews), and it has binary sentiments of positive and negative [24,25]. A review of NSMC dataset consists of no more than 140 characters, so we regarded a review as a sentence in our work. In other words, we carried out the sentiment analysis using the NSMC dataset to evaluate the sentence embedding based on various token units (word, morpheme, subword, and submorpheme).
We used the SentencePiece for subword tokenization and hypothesized that it is unnecessary to replace the whitespace of SentencePiece on the Korean sentiment analysis task. For subword tokenization, we carried out the SWU and SWT methods depending on whether the whitespace is replaced with a particular symbol, "_" (underbar). The Sen-tencePiece algorithm for subword tokenization creates the vocabulary according to what vocabulary size is set. Cho (2020c) tested the SentencePiece with vocabulary sizes 50K, 75K, 100K, and 125K, and then the results showed a performance improvement when vocabulary size was small-50K [27]. To prove that the smaller the vocabulary size, the better the performance for vocabulary sizes smaller than 50K, we explored the vocabulary sizes 2K, 4K, 8K, 16K, and 32K. For the submorpheme tokenization, we utilized the same morphological analyzers with morpheme tokenization (Mecab, Okt, and KLT2000) and SentencePiece. As we expected that submorpheme tokenization has the same trend of subword tokenization, we explored the sentence embedding with submorpheme tokenization in two cases, namely, SMU and SMT. As for subword, we explored the vocabulary sizes 2K, 4K, 8K, 16K, and 32K in the experiments of submorpheme tokenization with Okt and KLT2000. In submorpheme tokenization with Mecab, we tested the vocabulary sizes 2K, 4K, 8K, and 16K because vocabulary size 32K does not work when a morpheme sequence using Mecab is applied to SentenePiece.
We confirmed the data distribution of the NSMC dataset using each token unit (e.g., the number of tokens, OOV rate, and average token length). In Table 2, N(token) refers to the number of tokens in the trainset or testset of the NSMC datatset. OOV rate is the ratio of the number of unknown tokens to the number of testset tokens, as in Equation (3). Avg_length is the average length of tokens in the trainset or testset of the NSMC dataset.
The average length of tokens is composed by the sum of length of all token over the number of total tokens in the dataset as Equation (4), where n is the number of total tokens in the dataset and t i is the i-th token in the total tokens. For example, "너무나/감동적인/영화" by word tokenization has an average token length of 3+4+2 3 = 3 because this example has token lengths of 3(너무나), 4(감동적인), and 2(영화), and then n is 3.  Table 2 shows that the larger the number of tokens in trainset and testset, the lower the OOV rate. Besides, in subword and submorpheme tokenizations, we found that minor differences between the data distributions of SWU and SWT (or SMU and SMT), even between the vocabulary sizes. In Table 2, the word tokenization has the smallest number of tokens and a higher OOV rate of 26.48% compared to other token units. This means the word tokenization is not robust to the OOV problem caused by unknown tokens. Among morpheme tokenizations, the KLT2000 showed the smallest number of tokens and the highest OOV rate of 6.53%, whereas the Mecab and Okt had the larger numbers of tokens and smaller OOV rates of 1.032% and 2.56%, respectively, compared to KLT2000. In subword tokenizations, SWU and SWT show similar data distributions, and then SWT shows a slightly lower OOV rate than SWU. We expect that SWT eliminates some noise by duplicated token due to the symbols such as "_영화" (_movie) and "영화" (movie). We also found that both SWU and SWT in subword tokenization have large numbers of tokens and lower OOV rates with small vocabularies, but their difference is trivial. The submorpheme tokenization is similar to the trend of subword tokenization. Table 2 shows a representative data distribution of submorpheme with Mecab. The specific data distributions of the submorpheme unit, including Okt and KLT2000, are presented in Table A1.

Experiments and Results
To evaluate the sentence embedding based on various token units, we carried out the Korean sentiment analysis task using SVM, MLP, and LSTM as classifiers. Table 3 shows the sentiment analysis accuracy of sentence embedding based on word and morpheme units. Tables 4 and 5 show the sentiment analysis accuracy of sentence embedding based on subword and submorpheme units, respectively. In Table 3-5, these indicate the performance according to the vector sizes 200, 250, and 300. Overall, the accuracy of LSTM was better than those of SVM and MLP. This means that the sentence representation method considering the order of tokens in the sentence is more effective than not considering the order. Thus, based on the performances of LSTM, we analyzed the experimental results of sentence embedding.
As shown in Table 3, morpheme-based sentence embedding outperforms word-based sentence embedding. Word-based sentence embedding showed an accuracy of 81.27%, but it was significantly less accurate than morpheme-based sentence embedding. Among the morphological analyzers for morpheme-based sentence embedding, when we utilized Mecab, the morpheme-based sentence embedding achieved the best accuracy at 85.39%.  Table 4 shows the performances of sentiment analysis using the subword-based sentence embedding according to vocabulary sizes comparing the SWU and SWT. As shown in Table 4, we found the two key points. First, SWT outperformed SWU among the subword-sentence embedding methods. Second, the performance improved when the vocabulary size was large. The sentence embedding based on SWT achieved 85.67% accuracy, whereas the sentence embedding based on SWU achieved 85.42% accuracy. The difference between performances of sentence embedding utilizing SWU and SWT was trivial, but the sentence embedding based on SWT indicates higher accuracy than the sentence embedding based on SWU. In the comparison of vocabulary sizes, the sentence embedding based on SWU and SWT achieved the accuracies of 81.57% and 81.59% for vocabulary size 2K, respectively, whereas for vocabulary size 32K, the sentence embedding based on SWU and SWT achieved the accuracies of 85.34% and 85.52%. Although sentence embedding with SWU and SWT achieved the best accuracies of 85.42% and 85.67% for vocabulary size 16K, respectively, we found a tendency that the larger the vocabulary size improves performance in Table 4. We confirmed the performance of sentence embedding based on submorpheme tokenization in Table 5. First of all, unlike our expectation that submorpheme-based sentence embedding outperforms subword-based sentence embedding, submorpheme-based sentence embedding had lower performance than subword-based sentence embedding. The submorpheme-based sentence embedding with Okt_SMU achieved 85.32% accuracy as its best performance, whereas the subword-based sentence embedding achieved the better performance of 85.67% in SWT. We further found three tendencies. First, among the morphological analyzers utilized in submorpheme tokenizations, Okt showed better performance than the other morphological analyzers. Second, SMU outperformed SMT among the submorpheme-based sentence embedding methods, unlike the experimental results of subword-based sentence embedding. Lastly, the performance improved when the vocabulary size was large, similarly to the subword-based sentence embedding. In Table 5, the sentence embedding based on SMU indicated the performances of 84.83%, 85.32%, and 84.82% in Mecab, Okt, and KLT2000, respectively, whereas the sentence embedding based on SMT indicated the performances of 84.74%, 85.16%, and 84.83%. The performance of submorpheme-based sentence embedding was improved when the vocabulary size was 32K compared to 2K, just like subword-based sentence embedding. As shown in Tables 3-5, the best tokenization method for the Korean sentence embedding is SWT of subword tokenization with vocabulary size 16K-85.67% accuracy. Multi-lingual BERT and KoBERT achieved 87.5% and 90.1% accuracy, respectively, on the sentiment analysis using the NSMC dataset. However, our method is competitive with multi-lingual BERT and KoBERT because the capacities of multi-lingual BERT and KoBERT have a lot of computation by conducting the pretraining tasks with large corpora, whereas our method has a small amount of computation with simple classifiers. Our research further is valuable in suggesting the most efficient token units for sentence embedding.

Analysis and Discussion
We can now answer the three questions on the data distribution and performance of sentence embedding on the Korean sentiment analysis task.

1.
Which tokenization is more robust to the OOV problem? 2.
Is the symbol replacing whitespaces meaningful in the Korean sentiment analysis? 3.
What is the optimal vocabulary size for sentence embedding?
It is known that word embedding causes the OOV problem and that morpheme embedding is effective at alleviating the OOV problem. Considering those facts, we compared the performances of sentence embedding based on word, morpheme, subword, and submorpheme tokenization by correlation with OOV rate. As shown in Table 2, the OOV rate was the lowest, from 0.028% to 0.202%, in subword tokenization, followed by the submorpheme tokenization, from 0.029% to 0.084%, the morpheme tokenization, from 1.032% to 6.53%, and then the word tokenization (the highest) at 26.48%. The performances of subword-based sentence embedding outperformed those of the sentence embedding based on other tokenizations as shown in Tables 3 and 4. Although the submorpheme tokenization had a lower OOV rate like the subword tokenization, the performance of submorpheme-based sentence embedding (at 85.32% accuracy) was lower than that of subword-based sentence embedding (at 85.67% accuracy). We expect that submorpheme tokenization loses the syntactic and semantic meanings of a token because it is too finely decomposed by the combination of data-driven algorithm and morphological analyzer. Through the result that performance is generally improved when the OOV rate is low, we got the solution to: "Which tokenization is more robust to the OOV problem?" Subword tokenization is more robust to the OOV problem than other tokenizations.
In subword and submorpheme tokenizations, we confirmed the ratio of duplicated tokens due to the symbol "_ (underbar)" in SWU to analyze the effect of replacing the whitespace with a particular symbol. The duplicated token rate was calculated from Equation (5) We obtained the duplicated token rates in the trainset of each vocabulary size. The results show 12% (±4.755), which is the mean ± standard deviation for the duplicated token rate of each vocabulary size. It means that SWU learns noise of 12% (±4.755) in the trainset and it supports the result that SWT outperforms SWU among the subword-based sentence embedding methods. Thus, we conclude that the replacement with symbol is not effective in the Korean sentence embedding. In the submorpheme tokenizations, however, the results are different from the subword tokenization. SMU had better performance than SMT among the submorpheme-based sentence embedding methods. We expect that was because the submorpheme tokenization proceeds with the additional tokenization by SentencePiece from the morpheme tokens that have already been decomposed by morphological analyzer; the symbol of submorpheme tokenization does not have a role for whitespace.
Additionally, while expanding the experiment of Cho (2020c), we obtained the solution to: "What is the optimal vocabulary size for Korean sentence embedding?" Cho (2020c) empirically concluded that the performance improves as the vocabulary size lowers, while experimenting with vocabulary sizes 50K, 75K, 100K, and 125K [27]. To verify that a vocabulary size smaller than 50K has better performance, we tested the subword and submorpheme tokenization with vocabulary sizes 2K, 4K, 8K, 16K, and 32K. The experimental results showed a tendency to improve performance when the vocabulary size was large in Tables 4 and 5, contrary to the conclusion of Cho (2020c). To analyze these results, we confirmed the average token lengths of our experiments (vocabulary sizes 2K, 4K, 8K, 16K, and 32K) and those of Cho (2020c) (vocabulary sizes 50K, 75K, 100K, and 125K). As shown in Table 2, the average token length increases when vocabulary size increases in both subword and submorpheme tokenizations. Vocabulary sizes 50K, 75K, 100K, and 125K, which are tested in Cho (2020c), got the average token lengths of 2.769, 2.889, 2.962, and 3.016 for the trainset of SWU, respectively. With the trainset of SWT, the vocabulary sizes got the average token lengths of 2.193, 2.288, 2.346, and 2.389, respectively. In our work, the subword-based sentence embedding in both SWT and SWU had the best accuracy with vocabulary size 16K, and the average token length closed to about 2 or 2.5, even with submorpheme-based sentence embedding. When analyzing the results with our average token length, we came to a new conclusion that the optimal vocabulary size is 16K because the average token length closes to about 2 or 2.5.

Conclusions
To find the optimal token unit for constructing the Korean sentence vector, we carried out the Korean sentiment analysis task according to the sentence embedding with various token units: word, morpheme, subword, and submorpheme. When representing the sentence vector from the token unit sequence, we carried out the two sentence representation methods: considering the order of tokens in the sentence or not. We empirically answered that when the SWT of subword tokenization is utilized in the sentence representation method considering the order of tokens in a sentence, the performance is best at 85.67% accuracy. We investigated the properties of token units to analyze our experimental results and found some key points. (1) Subword tokenization is more robust to the OOV problem than morpheme tokenization because of the lower OOV rate. (2) Replacing the whitespace with a particular symbol is not effective for subword and submorpheme tokenization for the Korean sentiment analysis task. (3) In the tokenizations utilizing the SentencePiece algorithmm such as subword and submorpheme tokenization, the vocabulary size 16K achieved the best performance because the average token length closed to about 2 or 2.5. We expect that our research will present the foundations for research on effective sentence embedding with simple computations.