Indonesian News Text Summarization Using MBART Algorithm

. Purpose: Technology advancements have led to the production of a large amount of textual data. There are numerous locations where one can find textual information sources, including blogs, news portals, and websites. Kompas, BBC, Liputan 6, CNN, and other news portals are a few websites that offer news in Indonesian. The purpose of this study was to explore the effectiveness of using mBART in text summarization for Bahasa Indonesia. Methods: This study uses mBART, a transformer architecture, to perform fine-tuning to generate news article summaries in Bahasa Indonesia. Evaluation was conducted using the ROUGE method to assess the quality of the summaries produced. Results: Evaluation using the ROUGE metric showed better results, with ROUGE-1 of 35.94, ROUGE-2 of 16.43, and ROUGE-L of 29.91. However, the performance of the model is still not optimal compared to existing models in text summarization for another language. Novelty: The novelty of this research lies in the use of mBART for text summarization, specifically adapted for Bahasa Indonesia. In addition, the findings also contribute to understanding the challenges and opportunities of improving text summarization techniques in the Indonesian context.


INTRODUCTION
Current technological changes have resulted in online users facing information overload due to the rapidly growing amount of textual information on websites, making it difficult to read through the information [1].Web-derived sources on the internet, such as blogs, social media networks, news, and so on, are a huge source of textual data [2].There are many websites that provide news in Indonesia, such as Kompas, BBC, Liputan 6, CNN, and so on.These media produce news and articles every day [3].More and more online documents require summarization in order to help online users understand information.Text summarization of online documents is done so that users do not spend time looking for the information needed [4].
Text summarization is the process of summarizing a long text into a short text while maintaining the main idea.In natural language processing (NLP) and information retrieval, automatic text summarization is one of the fundamental tasks [5].Text summarization can be applied in various industrial fields, such as news, aggregators, blogs, product descriptions, and others [6].Text summarization can make it easier for search engines to search for content compared to searching in full text.Digital businesses such as e-commerce can also benefit from text summarization to display a brief description of the product.Text summarization can also help journalists display news headlines [7].
Text summarization can be classified into three categories: extractive, abstractive, and hybrid.Extractive summarization is done by finding important parts of the content and forming a subset of sentences from sentences contained in the original document [8].Extractive summarization does not add words to existing content and cannot combine two or more sentences to summarize content.Extractive summarization works on the basis of combining words or phrases from the corpus for summary [6], [9], [10].Hybrid summarization combines extractive with abstractive.The method has the drawback of producing lowerquality abstractive summaries compared to the pure abstractive approach [2], [11], [12].
Abstractive summarization performs summarization by understanding the given sentence and developing relevant summary sentences on its own.Abstractive summarization is also more flexible in generating summaries [13], [14], [15].Unlike extractive summarization, which can produce poor sentences, abstractive summarization can produce grammatically correct sentences [6], [16], [17].The abstractive method paraphrases and rearranges sentences into a summary [18], [19].In this research, we will use the abstractive summarization method.This is due to the advantages possessed by the abstractive method.
In using the abstractive summarization method, there are various algorithms that can be used, one of which is the multilingual version of the BART algorithm.BART is one of the pre-trained systems based on transformer architecture [20], [21].Currently, there is a multilingual version of BART, or what can be called mBART.One of the languages that can be processed is Indonesian.The use of mBART in text summarization can produce a good model.Some research on the use of mBART has been conducted in several languages, such as Russian [22], [23], Vietnamese [24], [25], [26] dand various other languages.From some of these studies, the evaluation results can produce good values, such as in Vietnamese language research [24], [26], [27] which get a rough-value of 55.21, a rough-2 of 25.69, and a rough-L of 37.33 for a dataset called WikiLingua, and for the Vietnews dataset, a rough-value of 59.81, a rough-2 of 28.28, and a rough-L of 38.71.In this study, research was conducted on the use of mBART in the text summarization of Indonesian news.
Further discussion in this paper included the following: Section 2 contains a description of the research method regarding the application of the MART algorithm in the text summarization of Indonesian news.Section 3 contains the results and discussion, followed by the conclusion drawn in Section 4.

Proposed method
The proposed research method for text summarization using mBART is described in Figure 1, covering the steps of pre-processing, fine-tuning, training, evaluation, and summary prediction.It starts with preprocessing to fine-tune the XL-SUM dataset of Hugging Face: https://huggingface.co/datasets/csebuetnlp/xlsum/viewer/indonesian [28], [3].The method ensures the readiness of the data for the next steps.Using the Google Colab Python tool to retrieve and split the dataset, the next step, fine-tuning the mBART model, was followed by training to optimize model performance.Evaluation, using the ROUGE metric, assesses the quality of the summary against the reference or original text.Before entering the summary prediction stage, the ability of the model to produce short and precise summaries is tested.

Literature study
In carrying out research, the first step is to combine a number of literature studies or research related to having similar problems or topics.Scientific articles, journals, books are one of the literature sources that can be used.The selection of literature studies is based on the same problem, namely text summarization.The type of text summarization chosen is the abstractive method using mBART.

Dataset text summarization
The text summarization dataset collection method can be done by several methods.The first method is manually scraping the website and then making a summary manually.The first method is very ineffective because it requires a long time to make datasets.The second method is by asking permission from similar research, namely text summarization, to be used.The third method is to use a public dataset.The third method can get datasets from open dataset websites such as Kaggle, UCI Machine Learning Repository, and so on.In this study, we used the third method, namely the method of using a public dataset, namely XL-SUM, which contains 44 languages, one of which is Indonesian [3].XL-SUM is a large and diverse dataset that includes 1 million pairs of professionally annotated article summaries taken from the BBC using a series of carefully designed heuristics.The dataset used is only the Indonesian language part of XL-SUM.The selection of the XL-SUM dataset is due to the ease of using and accessing the dataset, namely by accessing it through the library on the HuggingFace website.

Evaluation of results
In this study, automatic evaluation of results was carried out using ROUGE in accordance with previous research on the same topic.ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a package that includes several automatic evaluation methods that calculate the similarity between summaries.There are 4 types of ROUGE calculations, namely ROUGE-N, ROUGE-L, ROUGE-W, and also ROUGE-S [29].In this research, the types of ROUGE which were used are ROUGE-N (ROUGE-1, ROUGE-2) and ROUGE-L.The results of ROUGE used as an evaluation of the model that has been made and as a comparison with previous research.

RESULTS AND DISCUSSIONS Dataset preparation
The dataset used in this research is a dataset in the form of Indonesian news articles.The dataset used is XL-SUM [3].XL-SUM has 44 article languages, with one of them being Indonesian.In the XL-SUM dataset, the dataset has been divided into train, test, and validation.There are 47,800 Indonesian-language datasets in XL-SUM, with a division of datasets for training of 38,200 and testing and validation of 4780.The dataset contains articles from BBC News.The dataset contains an ID, a URL, a title, a summary, and text, as shown in Table 1.

Pre-processing
Before the data could be given to the training model, it would be pre-processed first.This is so that the model can learn from the given data.The pre-processing done is a tokenizer, as well as loading the model and data collator.

Tokenizer
The tokenizer performs the breakdown of text into tokens according to the terms of the desired rules.Some examples of tokenization that are often used are word tokenization and sentence tokenization [27].In this study, we use a pre-trained model so that to use a tokenizer in pre-processing, we must use the tokenizer associated with it.The pre-trained model used is mBART50, so the tokenizer used is also related to mBART50 [30].This is done in order to ensure that the split text corresponds to the same way in the corpus of the pre-trained model and also uses the same vocabulary at pre-training time.Tokenization in mBART50 is also based on SentenPiece [31].SentecePiece can perform sub-word model training directly from raw sentences [32].In the tokenization process, it would be limited to a maximum of 1024 tokens from the input data.Table 2 shows the results of the tokenization process.In table 4.2, after the text has undergone tokenization, there are 3 outputs, namely 'input_ids', 'attention_mask', and 'labels'.'Input_ids' comes from the 'text' category in the dataset, while 'labels' comes from the'summary' category.'Attention_mask' is an optional argument that is used when merging sequences.There are two values in 'attention_mask', which are 0 or 1. 1 in 'attention_mask' indicates tokens that need attention, and 0 indicates tokens that do not need attention.The numeric results in the 'input_ids' and 'labels' sections are token IDs derived from breaking the input sentence into tokens so that the tokens can be processed at a later stage.

Loan model and data collator
Before fine-tuning, the model would be downloaded first, which could be accessed on the HuggingFace website.MBART has several versions available, such as mbart-large-cc25, mbart-large-50, and others.In this research, we used mbart-large-50, which can be used in Indonesian.The model that has been downloaded is 2.44 GB.In addition, the padding process is also carried out using DataCollatorForSeq2Seq.This process is done in order to effectively perform the process of padding because it can be done dynamically to pad the longest sentence in a batch during inspection.The padding process is needed so that the tokens that have been processed before can have the same length, and then the tokens would be entered into the model.

Fine-tunning model
The model used is pre-trained using the Transformer architecture.A pre-trained model is one that has been trained in advance with other datasets.In this model, fine-tuning would be done.Fine-tuning is a technique used to adjust the model to a new dataset.MBART50 is one of the pre-trained so that fine-tuning will be done [33], [34].
In the process of fine-tuning this model, there is no change in the model architecture but only an adjustment of the model to the dataset used.This fine-tuning would be adjusted to the dataset used, namely the XL-SUM dataset for the Indonesian part.In this process, batch_size = 4, learning_rate = 2e-5, optimization = "adamw_torch", weight_decay = 0.01, save_total_limit = 3, and num_train_epochs = 1.These hyperparameters will be used in the next model training.

Training model
After pre-processing the dataset and selecting the hyperparameter model, the next step is the training process.The method used in this study is a pre-trained model based on the architecture of the transformer, namely mBART50.MBART50 is an extension of mBART25, so it has similarities in its model architecture [35].The mBART25 architecture is based on a sequence-to-sequence transformer architecture with 12 layers in each encoder and decoder and 1024 model dimensions in 16 heads (~680 million parameters).In addition, the mBART25 architecture has an additional normalization layer on top of each encoder and decoder [36], [37].The difference between the mBART25 architecture and mBART50 is that in mBART50, an embedding layer is added with a randomly initialized vector for an extra set of 25 new language tokens [33].
MBART50, which is multilingual for BART, then in the performance way for fine-tuning summarization, is to copy the information from the input but manipulate it, which is closely related to the purpose of denoising pre-training.In this case, the input encoder is the input sequence, and the decoder produces the output autoregressively [20].This training process uses Pytorch.The data used in the training process is about 38200 articles for the training part, and the validation dataset is 4780 articles.The training process takes 1 hour and 28 minutes for 1 epoch using the A100 GPU with 40GB of RAM for the GPU on Google Colab Pro.In the training process, it produces the loss in Table 3. Model training in the training process runs well.Then the model that has gone through the training process will be stored locally, namely in Google Drive.This storage is done to be able to use the model again during the process of trying to summarize the text.

Evaluation and comparison
The method used to evaluate the model is rough.ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics used in automatic text summarization evaluation [38].ROUGE consists of several automatic evaluation methods to determine the similarity of summaries [29].In the results of this study, the roughness used is based on roughness N (roughness 1 and roughness 2) and roughness L. ROUGE: N is a calculation of the number of corresponding n-grams between the text generated by the model and the reference.N-grams are collections derived from tokens or words [39].A unigram is an ngram derived from one word.Bigram is an n-gram derived from two consecutive words.While ROUGE-L performs calculations on the longest common subsequences (LCS) between the output derived from the model and the reference [40], [41].In the research results using mBART50, the resulting ROUGE is in Table 4. Table 4. Fine-tuning results of mBART50 Previously, one of the hyperparameters, learning_rate, was also tested to get the best results.In this test, 3 types of learning rates were carried out, namely 1e-4, 1e-5, and 2e-5.From these results, it can be seen that the difference is not far between one learning rate and another, so the best results are taken from using a learning rate of 2e-5 to compare with other model comparisons and benchmarks.
In this research, experiments were also conducted using another similar algorithm, mT5.MT5 is a multilingual model with an encoder-decoder architecture based on T5.In mT5, there are 5 versions of the model, namely small (≈ 300 million parameters), base (580 million), large (1.2 billion), XL (3.7 billion), and XXL (13 billion) [42].In conducting the comparison for this experiment, the mT5 version used is mT5 base [3], [43].The MT5 base is used because of the size of the model before fine-tuning, which is almost balanced with the mBART 50, which is 2.33 GB.From the results of the table, it can be concluded that mBART50 has better results than mT5 base, with a difference of 11.4272 for ROUGE 1, ROUGE 2 of 7.5777, and ROUGE L of 9.319, even though their model sizes are almost the same.In addition, a comparison is also made with the results of benchmark research using similar algorithms.Based on Table 7, the code AR refers to Arabic, IT to Italian, VI to Vietnamese, RU to Russian, and ID to Indonesian.Table 6 also shows that the ROUGE-1, ROUGE-2, and ROUGE-L scores obtained by researchers are still relatively low and have not exceeded some other studies.For example, AR XL-S was trained using the mBAR25 model and achieved ROUGE evaluation scores of R1 = 32.1,R2 = 12.5, and RL = 27.6.However, when trained with the XL-T dataset, the ROUGE score remained around R1 = 29.8,R2 = 11.7, and RL = 26.9[43].Likewise, IT MLSum-It achieved ROUGE scores of R1 = 19.3,R2 = 6.4,and R3 = 16.3 by using the mBART model [44].Another study that tested the mBART model with the RU Gazeta dataset (Russian language) obtained ROUGE evaluation results with values of R1 = 32.1,R2 = 14.2, and RL = 27.9 [22].Our evaluation of the three AR language models with the XL-S and XL-T datasets, the IT language model with the MLSum-It dataset, and the RU language model with the Gazeta dataset shows that each trained model faces various problems caused by several factors.These factors include the preprocessing stage, which only uses tokenizers and data collectors as padding, and the use of only one epoch for training.These constraints are triggered by resource limitations such as limited GPU usage on Google Colab Pro and the number of datasets that affect the final ROUGE score of the model.
We evaluate and compare prediction models that have been trained and tested to summarize news articles in Bahasa Indonesia using the Google Colab Pro tool.Our proposed model, mBART50, was trained using the XL-Sum ID dataset, which is a news dataset collected from BBC News Bahasa Indonesia.During the training stage, we found that the evaluation of the mBART50 model is highly dependent on the data used for training, especially at the pre-processing stage to clean up noise and irrelevant words in the documents.At the fine-tuning stage, the mBART50 model can summarize documents well without losing their original meaning, as shown by the evaluation using the ROUGE metric with scores of R1 = 35.9,R2 = 16.4,and R3 = 29.9.However, we also realize that the quality of the summary produced by the mBART Model also depends on the characteristics of the dataset and the type of language used in text summarization, which may affect the overall quality of the summary.The model proposed in this study, mBART50, shows significant results in summarizing Indonesian text.Proven by better evaluation using the ROUGE matrix with values (R1 = 35.9,R2 = 16.4,R3 = 29.9)compared to other models that have been evaluated in previous studies.The mBART model was trained on the XL-Sum ID dataset, which consists of filtered Indonesian news from BBC News.The mBART50 model has a specific approach to text handling in the Indonesian context.Pre-processing that filters out noise and irrelevant words in the documents adds to the quality of the summaries produced.The use of mBART as a multilingual transformer model signifies novelty in this research, while the rough-based evaluation provides high confidence in the validity of the evaluation results.Thus, the mBART50 model not only makes a significant contribution to improving the quality of Indonesian text summarization but also brings novelty by adopting a better approach to the use of language-specific technologies in the text summarization domain.

CONCLUSION
Based on the research, it can be concluded that the use of mBART in Indonesian text summarization has been explored, resulting in better progress in the development of text summarization models.The evaluation method using ROUGE shows better values with ROUGE-1 of 35.94, ROUGE-2 of 16.43, and ROUGE-L of 29.91.Nevertheless, the performance of the model is still not optimal and has not been able to outperform existing models.Challenges such as performance improvement and resource efficiency remain an important focus for future research.Thus, this research makes an important contribution to the development of Indonesian text summarization and highlights the need for quality and effectiveness improvements in future text summarization methods.

Figure 1 .
Figure 1.Flowchart of research flow Data collection method Data type The type of data required in this research is qualitative data.Qualitative data is non-numeric data.Text, sentences, words are examples of qualitative data and are needed as datasets for model building.The model created is a model about text summarization of Indonesian news.Therefore, the data needed is in the form of news articles in Indonesian.

Table 3 .
Results in loss

Table 5 .
Comparison results with learning rate

Table 6 .
Comparison between mBART50 and mT5 base

Table 7 .
Comparison of results with benchmark

Table 8 .
Text summary experiment