Utilizing Indonesian Universal Language Model Fine-tuning for Text Classification

. Inductive transfer learning technique has made a huge impact on the computer vision field. Particularly, computer vision applications including object recognition, segmentation, and classification, are seldom trained from scratch; instead, they are fine-tuned from pretrained models, which are products of learning from huge datasets. In contrast to computer vision, state-of-the-art natural language processing models are still generally trained from the ground up. Accordingly, this research attempts to investigate an adoption of the transfer learning technique for natural language processing. Specifically, we utilize a transfer learning technique named Universal Language Model Fine-tuning (ULMFiT) for doing an Indonesian news text classification task. The dataset for constructing the language model is collected from several news providers from January to December 2017 whereas the dataset employed for text classification task comes from news articles provided by the Agency for the Assessment and Application of Technology (BPPT). To examine the impact of ULMFiT, we provide a baseline that is a vanilla neural network with two hidden layers. Although the performance of ULMFiT on validation set is lower than the one of our baseline, we find that the benefits of ULMFiT for the classification task significantly reduce the overfitting, that is the difference between train and validation accuracies from 4% to nearly zero.


 Introduction
Text classification is defined as a problem of formulating models for organizing documents into pre-defined labels [27,26,31]. Nowadays, researchers have been developing systems with new text classification techniques with goals to achieve better computational efficiency and prediction accuracy than before [45,1,36,31,2].
The state-of-the-art text classification techniques are largely based on deep learning. Basically, neural networks with multiple hidden layers are considered as deep learning as depicted in Fig. 1. What special about deep learning is its capability of learning non-linear relationship from dataset [20]. The success stories of deep learning begin in computer vision field, specifically in image classification competition known as ImageNet [14]. Deep learning also enjoys its glory in natural language processing (NLP) field, notably automatic translation system as reported in New York Times magazine articles [25].
In general, there are two factors that drive the success of deep learning [33]. Firstly, the computational power that has the capability to compute weights of large neural networks grows exponentially through years. Secondly, huge amount of available data which is the source of training set has made the networks learn better than before. Consequently, significant NLP progresses can be done by institutions that have facilities to collect massive datasets, give labels to them, and do data crunching for training purposes. However, these costly training time may be reduced by employing transfer learning concept [21]. In short, transfer learning allows pretrained models to be reused for solving similar tasks. Well-known examples of transfer learning are utilizing pretrained networks such as Xception [6], Inception [42], ResNet50 [16], VGG16 [39], and MobileNet [37] for image classification.
This research is related to text classification; specifically, we investigate the use of transfer learning for doing Indonesian news text classification task and overcoming the overfitting problem. Experiments are performed in order to validate whether or not transfer learning can leverage text classification algorithms to increasing their performance and mitigating the overfitting problem. Regarding the choice of transfer learning algorithms, we opt for Universal Language Model for Fine-tuning (ULMFiT) because this algorithm has been reported as one of the stateof-the-art transfer learning algorithms [21].

 Literature Review
Prior work has documented three interconnected components which build ULMFiT such as regularization, weight-dropped LSTM, and transfer learning techniques.

Weight-dropped LSTM
The Long short-term memory (LSTM) is a popular variant of recurrent neural network [18,11] that is robust to vanishing or exploding gradients [5]. LSTM is formulated as follows: is hidden state at timestep t, c (t) is a state of memory cell, and ○ denotes an element-byelement multiplication. Weight-dropped LSTM is a modification to standard LSTM by adding a DropConnect [44] operation on the recurrent hidden to hidden weight matrices, specifically weight matrices {Ui, Uf, Uo, Uc} in Eq. (1), (2), (3), (4) respectively. Moreover, the use of DropConnect prevents LSTM from overfitting. The Weightdropped LSTM is commonly called AWD-LSTM [28].

Regularization
Regularization is a common technique in machine learning used to fight overfitting problem. Overfitting occurs when a text classification model has achieved an almost perfect performance on training set ( Fig. 2) whereas its performance is very bad on testing set (Fig. 3). Cost function of a neural network without regularization is defined as follows: with hθ(x (i) ) is our model prediction for x (i) , θs are parameters of the model, and y (i) is the ith label for the ith feature. In order to mitigate the overfitting, a constraint is added into Eq. (7); therefore, the cost function with regularization becomes with λ is a regularization parameter and θj are parameters of the model indexed by j=1, ..., n.
There are four additional regularization techniques that reduce data storage during model training and prevent overfitting of the LSTM model. Firstly, the variable regularization techniques allows an efficient data usage by randomly select a sequence length for forward and backward propagation. Secondly, the embedding dropout helps the model perform dropout on the embedding matrix at a word level; specifically, the dropout is broadcast across all word vector embeddings. The last two techniques are related to L2-regularization. Activation regularization used to keep outputs of activation functions from significantly greater than 0, whereas temporal activation regularization prevents the model from generating great changes in the hidden state [29].

Transfer learning
Transfer learning is a technique of utilizing a pretrained network [7]. There are two techniques of employing the pretrained network that are feature extraction and finetuning. Specifically, Universal Language Model Fine-tuning (ULMFiT) [21] does an initial training for constructing a language model (LM) on general-domain corpus and fine-tuning the LM on a target task, for example, text classification. Concretely, Fig. 4 shows three steps of ULM-FiT construction. Firstly, a LM is constructed by learning from a large dataset, Wikitext 103; owing to the movie review classification target, the LM is fine-tuned on IMDb corpus. Finally, the LM is trained to be ready as a classifier.

Research Methodology
This section starts with the description of our dataset. Next, we elaborate the steps of constructing the ULM-FiT. The target task in this research is to classify news articles released by the Agency for the Assessment and Application of Technology (BPPT). The dataset is available to download at https://github.com/hbunyamin/BPPTdataset. Specifically, Table~\ref{tab:contoh-dokumen-berita} displays samples of the BPPT dataset.

Preprocessing the dataset
Next, Table 2 displays the number of articles in each label. Besides BPPT dataset, our second dataset comprises a collection of several articles from news provider such as liputan6.com, kompas.com, tribunnews.com, detik.com, beritasatu.com, and tempo.co during year 2017. Particularly, total number of collected articles are 455.626. This dataset is used for training a pre-trained language model which is one of the ULMFiT model construction ingredients. A sample of the preprocessed news article just before tokenization process is shown in Table 3.  During training and evaluation processes, our BPPT dataset is divided into train set dan validation set. Each train set and validation set have 4 (four) folders to cover four categories elaborated in Table 4. In order to facilitate hyperparameter tuning, the setup of news provider dataset is slightly different from the BPPT dataset. News provider is divided into 3 (three) sets that are train set, validation set, and test set as explained in Table 5.
Before the ULMFiT model is constructed, the dataset needs preprocessing. A tokenizer from fast.ai library, specifically WordTokenizer class is employed and produces several special tokens which indicate particular positions of tokens such as xxbos (the beginning of a document), xxmaj (the next word begins with an uppercase letter), xxeos (the end of a document) [19]. Additionally, dates and numbers are converted into <date> and <number> respectively. Fig. 1 shows a sample of Table 3. A sample preprocessed news article before tokenization (long sentences have been split in order to fit the column width) Table 4. Detailed number of documents employed in training and testing processes for document classification Table 5. Number of documents broken down into train, validation, and test sets in order to train ULM-FiT preprocessing results. After preprocessing is done, the results are tokenized into tokens separated by a whitespace (whitespaces); therefore, each document is represented by a list of tokens.

Steps of building ULM-FiT model
After lists of tokens are obtained, numericalization process is carried out. The process of mapping from tokens into integers is named numericalization process. Basically, the process constructs a list of all possible tokens (a vocab) and replaces each token with its index in the vocab. Having all documents represented as lists of integers, each list of integers is divided into 64 batches (batch size=64). These batches are now ready to be converted into embeddings by using pre-trained language model (LM). Specifically, the pre-trained LM [30] is trained from news articles and illustrated by the first LM in Fig. 5. Next, the embeddings are fed into a recurrent neural network (RNN), using a AWD-LSTM architecture. The next step is fine-tuning the AWD-LSTM architecture on BPPT dataset. This process is needed because there is a possibility that the distribution of BPPT dataset is different from news provider dataset. We freeze the architecture and train only the embeddings. After training the embeddings, we unfreeze the architecture and train both embeddings and the architecture. Furthermore, we save all of the model except the final layer; the final layer contains an activation layer for computing the probabilities of picking each token in the vocab. We name the model except the final layer as encoder [19]. The encoder itself is represented as the second LM in Fig. 5.
The final step is fine-tuning a text classifier on BPPT dataset again; however, this time we let the classifier learn document categories of the dataset. Specifically, we load the encoder and train the classifier with discriminative fine-tuning and gradual unfreezing [21]. Moreover, we unfreeze several layers as the gradual unfreezing in NLP makes a significant difference when a few layers are frozen instead of all layers which is a common best practice in computer vision. Finally, we finish building our ULM-FiT model; the classifier in Fig. 5 illustrates the text classifier.
The hyperparameters for AWD-LSTM model in our experiments are recommended from the original paper [21] as depicted in Table [6].

Baseline models
Baseline models are needed as to examine the effect of ULM-FiT to reduce overfitting problem. The baseline models constructed as a comparison are neural networks with two hidden layer; specifically, each layer has eight nodes. The number is chosen because more than that number shall make the neural nets overfitting on the dataset. Moreover, each node in hidden layers has relu activation function. The decision to take this activation function is that this activation has capability to mitigate the vanishing gradient problem [34]. With equally the same reason, we also specify number of inputs are the most frequent 10.000 words and number of outputs are four categories, that are economics, international, science, and sports. Number of instances in validation set is 1.000 and the rest is considered as test set. Fig. 6 shows the training and validation error from the baseline model during 20 epochs. The training error reaches an almost perfect score, 99% whereas the validation error is 95%. The difference between train and validation set is 4%, which is quite large. This signifies an overfitting in the model. We shall see how ULMFiT model overcomes this major problem. Table 7 shows the training process of constructing the pre-trained LM based on pointer sentinel mixture architecture [30] on news articles dataset. This architecture acts as a pre-trained LM (the first LM) as described in Fig. 5. The learning runs smoothly as training and validation losses are getting smaller and smaller. Each epoch approximately takes 2.5 hours.

Validation performance of ULMFiT
Since both vocabularies of news provider and BPPT dataset are different, there might be several words in BPPT dataset which are not in the pre-trained model vocabulary. Consequently, the embeddings of words in the pre-trained model are merged with random embeddings for words which are not in the pre-trained model In order to merge both embeddings, we fine-tune the model by freezing the pre-trained model but not the embeddings. Table 8 shows the training and validation loss of this fine-tuning.
Next, the merged embeddings are fed into the state-of-the-art language model AWD-LSTM [28] and trained with discriminative learning rates whose initial values equal to 2 × 10 −3 . Importantly, discriminative learning rates are utilized because different layers bring in different kinds of information [46,21]. Therefore, each layer has a different learning rate. The discriminative learning are combined with Stochastic Gradient Descent (SGD) technique combined and are stated as follows: with t is a time step, [l] is lth layer, and α [l] is a learning rate for the lth layer. The results of training with discriminative learning rates are shown in Table 9. Finishing the training, we have an updated LM shown as the second LM in Fig. 5.
After training AWD-LSTM model, we save the LM model except the final layer named as an encoder. Next, this encoder is loaded before fine-tuning a classifier which becomes the final stage of ULM-FiT construction. The fine-tuning process adds two linear blocks on the model. Each block employs batch normalization [22] and dropout with ReLU activation functions for intermediate layers and a softmax activation for showing the probability distributions of every class in the last layer. Critically, input data for the final layer is the output of the last hidden layer, the average of all outputs from all hidden states, and the maximum of all outputs from hidden states. These three inputs are merged by a technique so-called concat pooling. We train the model with discriminative learning rates and gradual unfreezing. Firstly we unfreeze the last two parameter groups; secondly, we unfreeze the last three parameter groups; lastly, we unfreeze the whole model. This NLP best practice is slightly different from the one of computer vision which considers unfreezing the complete model at once. However, this NLP best practice proves significantly making improvements [19]. Furthermore, we opt for choosing a small number of epochs conforming to the original paper's setting [19] due to the so-called super-convergence phenomenon [40]. Table 10, 11, and 12 display the three gradual unfreezing processes respectively.
Specifically, Table 12 shows the prediction accuracy of ULMFiT. Utilizing ULMFiT produces a difference value between train and valid accuracies which approaches zero ( 2.5 × 10 −3 ). This remarkable result suggests ULMFiT should be the excellent choice of model to overcome overfitting. Table 9 . The training and validation loss during training AWD-LSTM with discriminative learning rates Table 12 . Results of training with gradual unfreezing the last two parameter groups Table 11 . Results of training with gradual unfreezing the last three parameter groups

Conclusion
This research explores the use of transfer learning technique to leverage language model in order to do a text classification task. Particularly, we implement Universal Language Model Fine-tuning (ULMFiT) for Indonesian language as this technique has the capability to overcome overfitting problem. To the best of our knowledge, we believe this is the first ULMFiT model fine-tuned to Indonesian news articles. Although the prediction accuracy in validation set is lower than the baseline method, ULMFiT successfully reduces the overfitting difference between training and validation accuracies from 4% to nearly zero.