Investigation of Pre-Trained Bidirectional Encoder Representations from Transformers Checkpoints for Indonesian Abstractive Text Summarization

Text summarization aims to reduce text by removing less useful information to obtain information quickly and precisely. In Indonesian abstractive text summarization, the research mostly focuses on multi-document summarization which methods will not work optimally in single-document summarization. As the public summarization datasets and works in English are focusing on single-document summarization, this study emphasized on Indonesian single-document summarization. Abstractive text summarization studies in English frequently use Bidirectional Encoder Representations from Transformers (BERT), and since Indonesian BERT checkpoint is available, it was employed in this study. This study investigated the use of Indonesian BERT in abstractive text summarization on the IndoSum dataset using the BERTSum model. The investigation proceeded by using various combinations of model encoders, model embedding sizes, and model decoders. Evaluation results showed that models with more embedding size and used Generative Pre-Training (GPT)-like decoder could improve the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score and BERTScore of the model results.


INTRODUCTION
Text summarization is one of the solutions that has been used to obtain quick and accurate data because it allows information to be gained more quickly and precisely without losing the meaning from the actual document (Widyassari et al., 2019). In its application in technology, text summarization can facilitate several aspects of work on search engines, digital business, and journalistic media (Adelia et al., 2019). In general, there are two approaches to do text summarization, which are extractive and abstractive. In the extractive approach, the system generates summaries by selecting important information in form of sentences or phrases from the source text, which is similar to classification problems. In contrast, the abstractive approach generates summaries by paraphrasing and generating new sentences or phrases while keeping the information related to the source text. Text summarization with extractive approaches is easier to implement and has more straightforward methods; therefore, the research in that area are more developed than research in abstractive approaches. However, the abstractive approach is ideal for summarizing text as it follows how humans generate summaries (Devianti & Khodra, 2019;Nallapati et al., 2016b). The most used model for abstractive text summarization is sequenceto-sequence models, which consist of encoder and decoder as they give great results. Several works have used this model (Nallapati et al., 2016a;Nallapati et al., 2016b;See et al., 2017;Shi et al., 2021;Zhou et al., 2017), starting from Rush et al. (2015) who successfully applied the model in machine translation tasks. Moreover, with the emergence of the transformer model (Vaswani et al., 2017), which is a breakthrough in Natural Language Processing (NLP), other breakthrough language models that use contextual representation pre-training have also emerged, such as Embeddings from Language Models (ELMo) (Peters et al., 2018), Generative Pre-trained Transformer-2 (GPT-2) (Radford et al., 2019), Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019), and Bidirectional and Auto-Regressive Transformer (BART) (Lewis et al., 2019). From that point, research on abstractive text summarization have begun to shift using these models as references because they are considered best practices. The most frequently encountered studies are using BERT as the foundation to build their models (Liu & Lapata, 2020;Rothe et al., 2020;Savelieva et al., 2020;Zhang, Kishore, Wu, et al., 2019). BERT's success has influenced other researchers to produce their own BERT version in other languages, such as the Chinese BERT (Cui et al., 2019), French BERT (Martin et al., 2019), German BERT (Rönnqvist et al., 2019), and Indonesian BERT Wilie et al., 2020).
As this paper was written, there were two well-known large-scale Indonesian BERT checkpoints with the same name, IndoBERT Wilie et al., 2020), which are used for several Indonesian NLP and Natural Language Understanding (NLU) tasks for benchmarking. Wilie et al. (2020) leveraged their pre-trained IndoBERT model checkpoints for single-sentence classification, sentence-pair classification, single-sentence sequence labeling, and sentence-pair sequence labeling tasks on 12 datasets. Meanwhile, Koto, Rahimi, Lau, et al. (2020) leveraged their model checkpoint for sequence labeling, semantic, and coherency tasks on nine datasets, including IndoSum in an extractive manner. There are no benchmarks for abstractive text summarization tasks from both papers.
However, Indonesian abstractive text summarization is recently gaining attention because the newly released large-scale dataset named Liputan6  has highly abstractive gold summaries. There is also another summarization dataset, IndoSum (Kurniawan & Louvan, 2018). Both datasets are news documentsummary pairs and have the potential to become benchmark datasets in Indonesian text summarization, such as Gigaword corpus (Rush et al., 2015), Newsroom (Grusky et al., 2018), XSum (Narayan et al., 2018), and CNN/Daily Mail (CNNDM) (Hermann et al., 2015) in English. However, the models and methods used for Indonesian abstractive text summarization are considered obsolete as compared to the English text summarization models. The methods that have been employed include the use of Sentence Fusion (Christie & Khodra, 2016), Abstractive Meaning Representation (Severina & Khodra, 2019), Genetic Semantic Graph (Devianti & Khodra, 2019), and Bidirectional Gated Recurrent Unit (BiGRU) in sequence-to-sequence models (Adelia et al., 2019). The methods utilized are outdated as research in English are using pre-trained language models for this task. There are also some problems regarding the evaluation result as there are hardly any standards for datasets and evaluation methods used in Indonesian text summarization. Since research in English frequently use BERT in their abstractive text summarization models, this paper would like to investigate and leverage two IndoBERT checkpoints Wilie et al., 2020) for the task in this paper.
This paper aims to investigate two IndoBERT checkpoints for abstractive text summarization tasks using the state-of-the-art model utilizing BERT, following Liu and Lapata (2020). The investigation proceeds by using various combinations of model encoders, model embedding sizes, and model decoders based on the findings while investigating IndoBERT checkpoints. The result of this study is reported with the IndoSum dataset on Recall-Oriented Understudy for Gisting Evaluation (ROUGE) (Lin, 2004) and BERTScore (Zhang, Kishore, Wu, et al., 2019) metrics.

RELATED WORKS
This section reviews the related research works to contextualize the present work. This section is divided into two parts: a review of research on abstractive text summarization in English for the general information of abstractive text summarization, and research on abstractive text summarization in Indonesian to identify the current development in Indonesian research.

English Abstractive Text Summarization
In recent works of English abstractive text summarization, the most used models are transformer-based and sequence-to-sequence models. Hoang et al. (2019) used a pre-trained Generative Pre-Training (GPT) model as a starting point for summarizing abstractive text. Their research proposed source embedding and domain-adaptive training that could facilitate the use of the GPT model as a text summary. Even though the model used parameters from the pre-trained GPT, there were differences in the type of language between the pre-training dataset and the article summary dataset, which were fictional stories and new. With domain-adaptive training, the model was trained to produce a type of language similar to the training dataset. Next, the model was trained on three datasets, Newsroom (Grusky et al., 2018), XSum (Narayan et al., 2018), and CNNDM (Hermann et al., 2015), to produce a summary of an article. The model scored a significant increase in ROUGE-L on two datasets, Newsroom and XSum. At the same time, the other model achieved higher scores in human evaluation on non-redundancy, coherence, and focus.
There are also some works that incorporate BERT. One of them utilized BERT in a sequence-to-sequence model that had a decoder (Zhang, Kishore, Wu, et al., 2019). The decoder used was a standard transformer decoder. However, there was a difference with this BERT, where it was pre-trained while the decoder was trained from scratch. With this situation, it was afraid that the decoder would not be able to use the context of BERT optimally; therefore, a two-stage decoding process was created to make maximum use of BERT's capabilities. On the CNNDM dataset, compared to previous studies, this study succeeded in improving performance with ROUGE. In another work (Liu & Lapata, 2020), BERT was also used in a sequence-to-sequence manner. This research proposed a new training method where the encoder and decoder had different optimizers. The encoder was configured to learn slower because it had gone through pre-training, while the decoder learned faster to keep up with the encoder. In addition, a two-stage training was carried out where in the first stage, the encoder was trained on summarizing extractive text, and then in the second stage, the model was trained on summarizing abstractive text. They produced excellent scores in extractive and abstractive for minimal parameter models. Afterward, the model in the previous work (Liu & Lapata, 2020) was used by Savelieva et al. (2020) to produce abstractive summarization of written instructions.

Indonesian Abstractive Text Summarization
There are numerous extractive text summarizations in Indonesian Christian et al., 2016;Garmastewira & Khodra, 2019;Halim et al., 2020;Hidayat et al., 2015;Najibullah, 2015); however, the abstractive part is not further investigated. Although there are already particular datasets for summarizing text Kurniawan & Louvan, 2018), these datasets are not widely used. One of the initial research in summarizing abstractive texts (Christie & Khodra, 2016) summarized many documents by using the Sentence Fusion method. Sentence Fusion is a method for generating a sentence from a collection of similar sentences and has been called a semi-extractive method. In implementing this method, machine learning was not required in the process and was more inclined to a clustering method with light pre-processing in the form of Part-of-Speech (POS) tagging and eliminating stopwords. The dataset used was in the form of Indonesian news articles from a previous research (Ilyas, 2015) with additional data taken by the researchers themselves. They used the ROUGE metrics in their research. However, in evaluating the clustering method, they did not mention the ROUGE scores. Oddly, they did not use ROUGE for evaluating the produced summary. Instead, human evaluation was used on grammatical and informativity. Another research (Devianti & Khodra, 2019) adapted the Genetic Semantic Graph method by using extraction of Subject, Verb, Object, and Adverbial (SVOA) from sentences plus some rules, cosine equations based on word embedding to calculate word similarities, and heuristic rules for Natural Language Generation (NLG). The dataset used was in the form of news articles taken from previous research ( Christie & Khodra, 2016;Garmastewira & Khodra, 2019). ROUGE-2 recall was used for evaluating the summaries.
The Abstractive Meaning Representation (AMR) method was used by Severina and Khodra (2019) to summarize the text of many documents in an abstractive way. The existing AMR graph was a highly specific tree structure for English because it was based on grammar rules. This study tried to make an AMR graph in Indonesian and used it in summarizing text. Before being made into the AMR graph, the existing documents went through Agglomerative Hierarchical Clustering to select sentences that represented multiple documents. After the AMR graph was created, the graph was re-selected by using Integer Linear Programming (ILP) and supervised learning via the perceptron. With the dataset that was self-gathered by the researchers, this study used ROUGE recall for evaluating the summaries.
The works mentioned are multi-document abstractive summarizations, which depend on clustering (Christie & Khodra, 2016) and graphs (Devianti & Khodra, 2019;Severina & Khodra, 2019) to pool the documents in the dataset. Such systems will not work optimally in single-document abstractive text summarization because of the difference in the number of the texts. In addition, the methods make the systems very dependent on the limited Indonesian resource available in the summarization dataset. Meanwhile, modern works in English ( Hoang et al., 2019;Liu & Lapata, 2020;Savelieva et al., 2020;Zhang, Cai, Xu, et al., 2019) used transfer learning with pre-trained models, which have been pre-trained on other datasets, to achieve better results in single-document abstractive text summarization.
For single-document abstractive text summarization, there is a work that utilized the sequence-to-sequence model (Adelia et al., 2019). This work used BiGRU as an encoder and Gated Recurrent Unit (GRU) with the attentional model as a decoder alongside a dataset in the form of an Indonesian journal document with an abstract as the summary target that was self-gathered by the researchers. The summary results contained repeated words, and the cohesion of the sentence was still not optimal, whereby the language elements in the sentences were used to construct the summary lack a relationship with one another.
It can be concluded that Indonesian abstractive text summarization methods used in available research are still not optimal for singledocument abstractive text summarization. Furthermore, there is another problem with the datasets and evaluation metrics employed. Each research used different datasets and evaluation metrics, which made the methods difficult to compare. As there is a large gap between the research progress in English and Indonesian abstractive text summarizations, this paper's objective is to close this gap. This paper addresses two problems that can be found in Indonesian research. First, to make the result easy to compare with other papers, the Indonesian public dataset IndoSum is used for training and testing the model. This paper also employs ROUGE (Lin, 2004) and BERTScore (Zhang, Kishore, Wu, et al., 2019) as evaluation metrics, following Koto, Lau & Baldwinl. (2020). Second, to reach the results gained in English research, BERT is used as there are currently two Indonesian BERT checkpoints with no benchmark on abstractive text summarization tasks. Experimental research in this paper investigates the use of them in building abstractive text summarization models.

METHODOLOGY
This section explains the methodology used in this paper to reach the research objectives, which starts from literature review, data collection, pre-processing, checkpoints exploration, modeling and fine-tuning, and evaluation as shown in Figure 1. A literature review was conducted to identify research problems in abstractive text summarization, mainly Indonesian. The next step was to collect a dataset that would be used in the training and model evaluation. Then, a pre-processing of the dataset was performed. After conducting an exploration on the IndoBERT checkpoints, the designing of the model using BERT was carried out. The model that was designed would then be fine-tuned and then evaluated with the ROUGE and BERTScore metrics.

Figure 1
The IndoBERT Checkpoints Investigation Method

Models and Exploration on IndoBERT Checkpoints
The model used in this paper for abstractive text summarization followed the model by Liu and Lapata (2020). Their model utilized a pre-trained BERT checkpoint as the encoder and standard transformers for the decoder. There were some variants of the model, namely extractive summarization (BERTSumExt), abstractive summarization (BERTSumAbs), and hybrid summarization that utilized extractive and abstractive methods (BERTSumExtAbs). This paper used the abstractive model, BERTSumAbs, for the experiments. For the

Models and Exploration on IndoBERT Checkpoints
The model used in this paper for abstractive text summarization followed the model by Liu and Lapata (2020). Their model utilized a pre-trained BERT checkpoint as the encoder and standard transformers for the decoder. There were some variants of the model, namely extractive summarization (BERTSumExt), abstractive summarization (BERTSumAbs), and hybrid summarization that utilized extractive and abstractive methods (BERTSumExtAbs). This paper used the abstractive model, BERTSumAbs, for the experiments. For the encoder-side, two Indonesian BERT checkpoints were applied. The first was IndoBERT (indobert-base-p2) from Wilie et al. (2020), which was trained in two phases for 1M and 68k steps. In the first phase, it was pre-trained with 128 tokens, while in the second phase, it was pre-trained with 512 tokens. The model was pretrained on the Indo4B dataset, consisting of 3.6B words from various sources that could be seen as a general dataset. The second was IndoBERT (indobert-base-uncased) from Koto, Rahimi, Lau, et al. (2020), which was trained for 2.4M steps on the dataset, comprising 220M words from three main corpora, Indonesian Wikipedia, news articles, and Indonesian Web Corpus. To avoid misleading as they both have identical names, from this point they will be called IndoBERT-NLU (indobert-base-p2) and IndoBERT-LEM (indobert-base-uncased), following their paper titles.
As both checkpoints came from benchmark papers, the papers used the IndoBERT checkpoints for benchmarking in some tasks. For IndoBERT-NLU, there were 12 tasks divided into four categories: encoder-side, two Indonesian BERT checkpoints were applied. The first was IndoBERT (indobert-base-p2) from Wilie et al. (2020), which was trained in two phases for 1M and 68k steps. In the first phase, it was pre-trained with 128 tokens, while in the second phase, it was pre-trained with 512 tokens. The model was pre-trained on the Indo4B dataset, consisting of 3.6B words from various sources that could be seen as a general dataset. The second was IndoBERT (indobert-baseuncased) from Koto, Rahimi, Lau, et al. (2020), which was trained for 2.4M steps on the dataset, comprising 220M words from three main corpora, Indonesian Wikipedia, news articles, and Indonesian Web Corpus. To avoid misleading as they both have identical names, from this point they will be called IndoBERT-NLU (indobert-base-p2) and IndoBERT-LEM (indobert-base-uncased), following their paper titles.
As both checkpoints came from benchmark papers, the papers used the IndoBERT checkpoints for benchmarking in some tasks. For IndoBERT-NLU, there were 12 tasks divided into four categories: single-sentence classification, single-sentence sequence-tagging, sentence-pair classification, and sentence-pair sequence labeling. For IndoBERT-LEM, there were seven tasks divided into three categories: morpho-syntax and sequence labeling, semantic, and discourse coherence. There wasa summarization task in the semantic category; however, they only benchmarked the extractive model. There was no abstractive summarization benchmark with IndoBERT from their respective paper.
Both shared the same parameter numbers. Both had 12 layers, a hidden size of 768, filter size of 3,072, and 12 attention heads. Nevertheless, the vocabulary (vocab) size and embedding layers were different. IndoBERT-NLU claimed it had a vocab size of 30,522; however, it was found that it had a vocab size of 30,521 in the actual checkpoint. In contrast to the vocab size, the embedding size in the model was set to 50,000.
Meanwhile, IndoBERT-LEM had a vocab and embedding size of 31,923. As for the decoder, this paper used six layers of standard transformer decoder with a hidden size of 768, filter size of 2048, and 8 attention heads (the architecture can be seen in Figure 2). Note that this decoder was not pre-trained. The embedding and vocab size of the decoder followed each of the IndoBERT checkpoints.

Figure 2
Architecture Comparison of Standard Transformer Decoder (left) and GPT-Like Decoder (right).
The models were fine-tuned for 20,000 steps in total (~44 epochs) to the IndoSum dataset. The encoder had already been pre-trained while the decoder had been initialized randomly. The fine-tuning might be unstable as the encoder might overfit while the decoder underfit or vice-versa. In order to make the fine-tuning more stable, two Adam optimizers with and for encoder and decoder were used with different learning rates and warm-up steps as presented in Equation 1.
(1) where x denotes either encoder e or decoder d. For the encoder, it was set as and warmup e = 8,000 while for the decoder, it was set as and warmup d = 4,000. This learning schedule would make the pre-trained encoder learn to be slower and the decoder to be faster while keeping the fine-tuning stable as was done for 20,000 steps.
Meanwhile, IndoBERT-LEM had a vocab and embedding size of 31,923. As for the decoder, this paper used six layers of standard transformer decoder with a hidden size of 768, filter size of 2048, and 8 attention heads (the architecture can be seen in Figure 2). Note that this decoder was not pretrained. The embedding and vocab size of the decoder followed each of the IndoBERT checkpoints.

Architecture Comparison of Standard Transformer Decoder (left) and GPT-Like Decoder (right).
The models were fine-tuned for 20,000 steps in total (~44 epochs) to the IndoSum dataset. The encoder had already been pre-trained while the decoder had been initialized randomly. The finetuning might be unstable as the encoder might overfit while the decoder underfit or vice-versa. In order to make the fine-tuning more stable, two Adam optimizers with and for encoder and decoder were used with different learning rates and warm-up steps as presented in Equation 1. (1) where x denotes either encoder e or decoder d. For the encoder, it was set as and warmupe = 8,000 while for the decoder, it was set as and warmupd = 4,000. This learning schedule would make the pre-trained encoder learn to be slower and the decoder to be faster while keeping the fine-tuning stable as was done for 20,000 steps.

tion of Pre-Trained Bidirectional Encoder Representations from ers Checkpoints for Indonesian Abstractive Text Summarization
(1) . ndard Transformer Decoder (left) and GPT-Like Decoder (right). or 20,000 steps in total (~44 epochs) to the IndoSum dataset. The trained while the decoder had been initialized randomly. The finee encoder might overfit while the decoder underfit or vice-versa. In ore stable, two Adam optimizers with and for d with different learning rates and warm-up steps as presented in (1) e or decoder d. For the encoder, it was set as and warmupe = was set as and warmupd = 4,000. This learning schedule oder learn to be slower and the decoder to be faster while keeping the or 20,000 steps.

Investigated Model Variants
The main model in this paper consisted of two BERTSumAbs models with different encoders, IndoBERT-NLU and IndoBERT-LEM. This section describes three other variations of the model.
IndoBERT-NLU-30kEmb: Earlier, it was mentioned that Indo BERT-NLU had different sizes of embeddings and vocab configuration so that IndoBERT-NLU was made to have the same size as Indo BERT-LEM. Another BERTSumAbs IndoBERT-NLU was fine-tuned with an embedding size similar to its vocab size of 30,521.
IndoBERT-LEM-50kEmb: Further investigation studied whether increasing the size of the embedding in IndoBERT-LEM to 50,000, as in IndoBERT-NLU, could increase the value of the evaluation. Another BERTSumAbs IndoBERT-LEM with an embedding size of 50,000 was fine-tuned.
IndoBERT-LEM-GPT: BERT was a stack of transformer encoders and GPT-2 was a stack of transformer decoders. Meanwhile, GPT-2 was known for its capability to train data and the parameter contained in the data. Some tinkering was made to the architecture where the layer normalization (Ba et al., 2016)  where output embedding and indicates temporary value. LN is the layer normalization, MHAtt is the multi-headed attention, SelfMHAtt gets input from y, while CrossMHAtt gets input from SelfMHAtt(value) and encoder output (query & key), and superscript l indicates the number of layers.
It is interesting to observer whether a GPT-like architecture in the decoder model could increase the evaluation scores. The three model

(5)
= output variants were fine-tuned with the same hyperparameters as the main models for 5,000 steps (~11 epochs). This paper also showed the results of the main models' checkpoint at 5,000 steps for comparison. Table 1 shows the combination of embedding size, encoder, and decoder for all the models mentioned, including the main models and variant models.

RESULTS AND DISCUSSION
The models were evaluated using the IndoSum dataset (Kurniawan & Louvan, 2018). Another summarization dataset, Liputan6 , was actually more abstractive and much more extensive than IndoSum. However, IndoBERT-LEM used the data in pre-training. There might be bias when the dataset was used with the IndoBERT-LEM checkpoint, and to compare the checkpoints fairly, this paper only employed the IndoSum dataset as a benchmark. IndoSum consisted of 19k document-summary pairs with 5-fold cross-validation to make the result more general as it was a low resource dataset. However, only the first fold of the dataset was used to make benchmarking easier for future work. The gold summaries on IndoSum appeared to have a high degree of extraction, signifying that it copied sentences from the source articles most of the time.
The case was lowered and the input documents and gold summaries were truncated to 512 tokens and 128 tokens, respectively, during the fine-tuning. The findings reported the ROUGE F1 scores (Lin, 2004), particularly R-1 (unigram overlaps) and R-2 (bigram overlaps) for informativeness and R-L (longest common subsequence) for fluency, as well as BERTScore (Zhang, Kishore, Wu, et al., 2019), following  as the metrics to count the probability based on BERT's contextual embedding that could capture more similarities between the gold summaries and system summaries. This paper used the ninth layer of cased version of multilingual BERT to compute BERTScore. Table 2 shows the test F1 scores of R-1, R-2, R-L, and BERTScore (BS) of all models described in the previous section. To the best of the authors' knowledge, there was no other abstractive summarization research using IndoBERT checkpoints Wilie et al., 2020) with the IndoSum dataset. Therefore, this paper only showed the scores of baseline and extractive models from previous studies. Nevertheless, Koto, Lau and Baldwin (2020) used IndoBERT-LEM in an abstractive summarization task to evaluate their dataset, Liputan6, using the same model as the present study, the BERTSumAbs model. In addition, a BERTSumAbs model with a random encoder and decoder was trained in this paper; however, it generated a sentence with random words for all articles in the test set, thus it was not included in the table. In general, all the models were still underperformed against the Oracle baseline. Nevertheless, as can be seen, most of the models outperformed the Lead-3 baseline by a large margin. Koto, Rahimi, Lau, et al. (2020) used IndoBERT-LEM for extractive summarization task with the BERTSumExt model and compared it with other BERT checkpoints, such as Multilingual BERT (MBERT) (Devlin et al., 2019) and monolingual Malaysian BERT, MalayBERT. From their experiments, the model built with IndoBERT-LEM had more ROUGE points than the rest. Compared to the BERTSumExt model with IndoBERT-LEM, the proposed abstractive model scores still lagged behind it. It had been predicted as the IndoSum dataset contained more extractive labels so that the extractive models should work better with the dataset.

Table 2
Results for the IndoSum First Fold Test Set.
R-1, R-2, R-L are ROUGE metrics. BS is BERTScore computed using bert-base-multilingual (layer 9) as suggested in Zhang, Kishore, Wu, et al. (2019). Note that models with * were computed using 5-fold validation of the IndoSum dataset. The bolded scores are the highest in main models and variant models.
For the next part, the two main models were compared using IndoBERT-LEM and IndoBERT-NLU as their encoders as presented in Figure 3. It was pointed out that IndoBERT-LEM outperformed IndoBERT-NLU in all scores. Furthermore, the R-L model with IndoBERT-LEM only improved +0.57 point from 5k steps to 20k steps. Meanwhile, the model with IndoBERT-NLU improved +2.07 point, higher than that of IndoBERT-LEM, indicating that IndoBERT-NLU needed more steps to converge.

Figure 3
Comparison of the Main Models at 5k Steps and 20k Steps. Table 3 shows the generated summary from the main models. An article that had a high abstractive reference summary was chosen. It can be seen that the summary generated was identical to the reference For the next part, the two main models were compared using IndoBERT-LEM and IndoBERT-NLU as their encoders as presented in Figure 3. It was pointed out that IndoBERT-LEM outperformed IndoBERT-NLU in all scores. Furthermore, the R-L model with IndoBERT-LEM only improved +0.57 point from 5k steps to 20k steps. Meanwhile, the model with IndoBERT-NLU improved +2.07 point, higher than that of IndoBERT-LEM, indicating that IndoBERT-NLU needed more steps to converge.

Figure 3
Comparison of the Main Models at 5k Steps and 20k Steps. Table 3 shows the generated summary from the main models. An article that had a high abstractive reference summary was chosen. It can be seen that the summary generated was identical to the reference summary in the beginning and began to differ in the middle to the end of the paragraph. However, the generated summary mostly still followed the facts from the article. The model with IndoBERT-LEM generated "during those three and a half hours" was taken from the article even though it was not supposed to be there. However, it succeeded in referring "Indonesia" to "Its soul". Meanwhile, the model with IndoBERT-NLU generated "spin wind", which as different but still had summary in the beginning and began to differ in the middle to the end of the paragraph. However, the generated summary mostly still followed the facts from the article. The model with IndoBERT-LEM generated "during those three and a half hours" was taken from the article even though it was not supposed to be there. However, it succeeded in referring "Indonesia" to "Its soul". Meanwhile, the model with IndoBERT-NLU generated "spin wind", which as different but still had the same meaning as "windmills" in Indonesian. However, it resulted in more incorrect and repeated words and unneeded random symbols at the end of the summaries. This might be because IndoBERT-NLU had been pre-trained on 128 tokens; therefore, it could hardly handle the text summarization task with 512 tokens dataset. muhammad hatta, is not just someone who is respected and proud in his own country. his name is eternal in other countries, including in the netherlands, a country that once colonized indonesia. since he was young, hatta has inhabited the country of the spin wind to continue his education. hatta became a student of handelshoges, the netherlands, and our democracy]. until now hatta hatta has returned to do conferences, hatta has always shown himself as a reliable figure behind the negotiating table.). / ).
Bear in mind that spaces were given for every symbol, following the generated summaries. Wrong, incomplete, and unneeded words and symbols were highlighted with bold font style. Rephrased words that had the same meaning and served as reference words were highlighted with underlined font style. Therefore, in the domain of abstractive news summarization, it can be ensured that IndoBERT-LEM was preferred over IndoBERT-NLU as it had been pre-trained longer on a news dataset with 512 tokens. Meanwhile, the latter was pre-trained on a more general dataset with 128 tokens even though with more extensive data than the previous dataset (220M words vs 3.6B words). This finding is consistent with Lewis et al. (2020), whereby the model that was pre-trained specifically on news data performed better in abstractive news summarization than the model that was pre-trained on more general data. The next experiment was comparing the model variations, whereby different sizes of embedding were set as in Figure 4. The results showed that more embedding could improve the performance of the models. Even the model with IndoBERT-LEM-50kEmb, which was only trained for 5k steps, was on par with the main IndoBERT-LEM model. It was observed that IndoBERT-NLU was pre-trained with 50k embedding size while IndoBERT-LEM was pre-trained with 32k embedding size. However, regardless of that, the models that were fine-tuned with more embedding size outperformed the models with less embedding size. Surprisingly, the last model variation, BERTSumAbs with IndoBERT-LEM-GPT, outperformed all other models even though it was only trained for 5k steps. Nevertheless, it had unstable fine-tuning with the same hyperparameter, whereby the model loss suddenly raised and remained there until the last steps. Therefore, the best checkpoint based on dev set loss was used to compute the scores. It was hypothesized that the learning rate might still be too big for the model. Tinkering with the decoder architecture showed promising results although more research is needed.

Figure 4
Comparison of the Model Variations and Main Models at 5k Steps.
Regarding the use of BERTScore, it revealed a higher score than ROUGE as it computed the similarity between words. However, it was found that the metric as still in line with ROUGE throughout the experiments and provided the same or even less insight than ROUGE. It might be due to the generated summaries that were more extractive. BERTScore should give more insight when the generated summaries were highly abstractive as the words might differ from the reference summaries Regarding the use of BERTScore, it revealed a higher score than ROUGE as it computed the similarity between words. However, it was found that the metric as still in line with ROUGE throughout the experiments and provided the same or even less insight than ROUGE. It might be due to the generated summaries that were more extractive. BERTScore should give more insight when the generated summaries were highly abstractive as the words might differ from the reference summaries but still had similar meaning.

CONCLUSION
This paper presented the results of Indonesian abstractive text summarization using the BERTSum and IndoBERT models. Two IndoBERT checkpoints were used, and further findings motivated this research to conduct experimental research on the embedding size and different decoders. The results showed that in the abstractive summarization task, the IndoBERT model, which was trained for more steps with more news data and embedding size, managed to achieve higher ROUGE scores. In addition, the model that used a GPT-like decoder achieved higher scores than the regular model that used a standard transformer decoder. This finding suggests that there are other possibilities for improving the BERTSum model, although more research is needed.
For future studies, research in Indonesian abstractive news summarization may utilize the optimal IndoBERT checkpoint and differentiate the decoder architecture on different datasets to observe another possibility of achieving higher scores. More research is also needed to examine the effectiveness of the BERTScore metric in abstractive text summarization to make better assessment of the text summarization system.