Twitter dataset on public sentiments towards biodiversity policy in Indonesia

In recent years, biodiversity has emerged as a prominent and pressing topic due to the urgent need to address biodiversity loss and the recognition of its connections to climate change and sustainable development. Additionally, increased public awareness and the consideration of economic factors have further underscored the significance of biodiversity conservation. To investigate the sentiment of the Indonesian people towards biodiversity, we conducted a comprehensive data collection on Twitter, focusing on keywords we have set. We amassed a substantial dataset of 500,000 Indonesian tweets from January 2020 to March 2023. These tweets encompassed a wide range of discussions on biodiversity, including its subdomains such as food security, health, and environmental management. Three annotators labeled each tweet with a sentiment class (positive, negative, neutral), or label none for unrelated tweet. The final label was determined using the majority voting method. The tweets with the final label none and those with undecided sentiment class were considered invalid and excluded in the subsequent process. Before labeling, a team of 18 experts jointly developed a labeling guide. This document served as a reference in labeling. After going through a series of processes, including cleaning (removing duplications, irrelevant tweets, and tweets written other than in Indonesian) and preprocessing, we prepared a dataset containing 13,435 tweets. We measured the inter-annotator agreement level, made several models using different algorithms and the K-Fold cross-validation method, and evaluated the models. The Fleiss' Kappa value of the dataset was 0.62187 as the value of the inter-annotator agreement level, and the F1-score value with the best model using the pre-trained IndoBERT model was 0.7959. The Fleiss' Kappa and F1-score values suggest that the annotators have a substantial comprehension and agreement of how to label a tweet, thus ensuring consistency and reliability of our dataset, and the reusability of our dataset is quite suitable for further research on sentiment analysis on biodiversity, respectively. This dataset will benefit various research, including topic modeling, sentiment analysis, public opinion analysis on Twitter, etc., especially biodiversity-related policies.


Sentiment analysis
Natural language processing Indonesian Health Environmental management Food security a b s t r a c t In recent years, biodiversity has emerged as a prominent and pressing topic due to the urgent need to address biodiversity loss and the recognition of its connections to climate change and sustainable development.Additionally, increased public awareness and the consideration of economic factors have further underscored the significance of biodiversity conservation.To investigate the sentiment of the Indonesian people towards biodiversity, we conducted a comprehensive data collection on Twitter, focusing on keywords we have set.We amassed a substantial dataset of 50 0,0 0 0 Indonesian tweets from January 2020 to March 2023.These tweets encompassed a wide range of discussions on biodiversity, including its subdomains such as food security, health, and environmental management.Three annotators labeled each tweet with a sentiment class (positive, negative, neutral), or label none for unrelated tweet.The final label was determined using the majority voting method.The tweets with the final label none and those with undecided sentiment class were considered invalid and excluded in the subsequent process.Before labeling, a team of 18 experts jointly developed a labeling guide.This document served as a reference in labeling.After going through a series of processes, including cleaning (removing duplications, irrelevant tweets, and tweets written other than in Indonesian) and preprocessing, we prepared a dataset containing 13,435 tweets.We measured the inter-annotator agreement level, made several models using different algorithms and the K-Fold cross-validation method, and evaluated the models.The Fleiss' Kappa value of the dataset was 0.62187 as the value of the inter-annotator agreement level, and the F1-score value with the best model using the pre-trained IndoBERT model was 0.7959.The Fleiss' Kappa and F1-score values suggest that the annotators have a substantial comprehension and agreement of how to label a tweet, thus ensuring consistency and reliability of our dataset, and the reusability of our dataset is quite suitable for further research on sentiment analysis on biodiversity, respectively.This dataset will benefit various research, including topic modeling, sentiment analysis, public opinion analysis on Twitter, etc We gathered a Twitter dataset of around 30 particular biodiversity-related keywords with dates ranging from January 2020 to March 2023.This data was then refined by filtering out irrelevant information, including non-Indonesian language content, non-Biodiversity data, spam, and duplicate entries.Independent analysts undertook the task of manually assigning sentiment labels to the dataset.These eighteen individuals consisted of twelve researchers and engineers specializing in natural language processing, of which two held Ph.D. degrees, nine had MSc degrees, and one had a BSc degree.Additionally, four lecturers and two experts in natural language processing, each with a Ph.D. or MSc degree, contributed to the labeling process.

Value of the Data
• The dataset was created by collecting information using expert-selected keywords relevant to biodiversity issues widely discussed by the public, covering the subdomains of food security, health, and environmental management.It was meticulously labeled by native speakers through a multi-step process, ensuring consistency and reliability by measuring inter-annotator agreement.• These data provide valuable insights into public opinion on biodiversity issues on Twitter, serving as a valuable guide for the development of public policies.The dataset validates sentiment and contextual information and enables analysis of public opinion on biodiversity issues, which often encompass diverse viewpoints.• The government institutions, academics, observers, and communities engaged in sentiment analysis research in the three biodiversity subdomains can utilize this dataset, both for gathering public opinions on biodiversity-related topics and for developing sentiment classification models in the Indonesian language using various artificial intelligence methods.
• The availability of this dataset supports research on sentiment patterns, public perception, and evidence-based decision-making, offering valuable insights into public perspectives on health, food security, environmental management, and biodiversity policies.This dataset can be used to help identify words that tend to convey negative and positive sentiments towards government policies.With this dataset, decision-makers can study how the public reacts to these policies and assess whether these policies receive support or criticism.

Data Description
We have released a dataset of Twitter's tweets on biodiversity, including issues on its subdomains: food security, health, and environmental management.The tweets' dates range from January 2020 to March 2023.The tweets originated from user accounts in the Indonesian language.The dataset comprises four files: raw, cleaned-and-selected, sentiment-class labeled, and ready-to-be-trained data.
All dataset is available in the repository Mendeley Data [1] .Data identification number: 10.17632/xtk9wsxjjr.4Direct URL to data: https://data.mendeley.com/datasets/xtk9wsxjjr/4a. Raw data (biodiversity_raw.csv) The raw data was collected based on carefully selected keywords, often used on issues currently discussed by the Indonesians.Along with tweet ID, each data in the files has additional information such as keyword and which subdomain the keyword belongs.The raw data comprises 50 0,0 0 0 tweet IDs, keywords, and each tweet's subdomain.These tweets were collected using some keywords listed in Table 1 .

b. Cleaned and selected data (biodiversity_cleaned.csv)
The tweets in this dataset were selected by considering the proportion of the number of tweets based on keywords and subdomains.We removed duplicated data, unrelated data to biodiversity, and non-Indonesian data to create a cleaned-and-selected dataset comprising 200,015 tweet IDs, keywords, and the subdomain to which each tweet belongs.

c. Labeled data (biodiversity_labeled.csv)
From the cleaned data, we conducted data sampling for the generated labeled dataset as ground truth, called labeled data.This step involved the process of data labeling, which was performed on a data sample comprising 15,323 tweets.The data labeling process was conducted by six distinct groups, with three annotators in each group.Each annotator will assign a sentiment classification (positive, negative, neutral) to each tweet or none for unrelated tweet.The ultimate classification is ascertained by aggregating the majority decision from the three annotators.Consequently, when there is a discrepancy in the labels assigned by each annotator, the tweet is classified as invalid.The degree of agreement among annotators is quantified using the Fleiss' Kappa score.The Fleiss' Kappa score ranges from -1 to 1, a statistical measure used to assess the level of agreement among annotators when evaluating the same entity [2] .A value close to 1 suggests a high level of understanding that exceeds what would be expected by chance alone.Conversely, a value close to 0 indicates no better agreement than what would be expected by chance.Negative values indicate agreement worse than what would be expected by chance.
To describe the combined Fleiss' Kappa scores for Food Security, Environmental Management, and Health, we use the term "overall Fleiss' Kappa score" or "composite Fleiss' Kappa score."This term signifies the combined or aggregate measurement of agreement among multiple annotators for the three subdomains of biodiversity: Food Security, Environmental Management, and Health as shown in Table 2 .
Fig. 1 illustrates the distribution of positive, negative, and neutral labels and two special labels, none and invalid, across the health, food security, and environmental management subdomains.The health subdomain has the highest count of positive tweets (2566), while food security has the highest count of negative tweets (2,072).Similarly, environmental management has the highest count of neutral tweets (10 6 6), and food security has the highest count of tweets labeled as "none" (751).Regarding invalid tweets, the subdomain health has the lowest count (65).These findings provide valuable insights into the sentiment distribution within each category, contributing to a better understanding of sentiment trends in health, food security, and environmental management discussions on Indonesian Twitter.The labeled dataset contains 15,323 tweet IDs, keywords, subdomains to which the tweet belongs, first annotator label, second annotator label, third annotator label, and final label.For modeling, only tweet data with positive, negative, and neutral final labels were used, so all data labeled with none and invalid were removed.The total tweets obtained were 13,435.This dataset contains tweet IDs, keywords, subdomains to which the tweet belongs, and final label.
Fig. 2 presents sentiment distribution within the health, food security, and environmental management subdomains in the 4th dataset.It shows a prevalent positive sentiment toward health-related topics, with many positive tweets (2566).However, this category also has significant negative tweets (970) and neutral tweets (960).Similarly, positive sentiments are present in the food security and environmental management subdomains, but there are also notable numbers of negative tweets.Overall, the figure provides an overview of the sentiment composition in the dataset, indicating a mix of positive, negative, and neutral sentiments expressed by Indonesian Twitter users regarding health, food security, and environmental management topics.In Table 3 , we utilized word clouds to visualize frequencies of word appearances in each sentiment class within each subdomain, offering a more comprehensive depiction of our data.
The figures in Table 3 depict visual representations of frequently used words within health, food security, and environmental management subdomains, categorized according to their respective positive, negative, and neutral sentiment classes.Within the realm of health, the term "cegah " (meaning "prevent" in English) holds a prominent position within the positive sentiment category.This phenomenon can be linked to various government and community initiatives, public guidance and perspectives, and diverse news coverage, all aimed at collectively mitigating the spread of stunting and its associated consequences, such as hindering societal progress.In the negative sentiment class, the words "angka " (numbers), "gizi " (nutrition), and "balita " (toddlers) exhibit a notable frequency of occurrence.Shifting our focus to the realm of food security, it is evident that three key terms, specifically "pangan " (food), "masyarakat " (society), and "pertanian " (agriculture), prominently represent the positive sentiment category.The term that exhibits the highest frequency within the negative sentiment category is "harga ," which translates to "price" in English.Moreover, within environmental management, the term "PLN" (referring to a government-owned electricity company) exhibits a prevailing presence in the positive sentiment category.In contrast, the negative sentiment category is predominantly characterized by "banjir " (denoting a flood event).

Experimental Design, Materials and Methods
Fig. 3 shows the process flow of creating and evaluating our dataset.The explanation of each process is described below.

a. Raw Data Collection
Before collecting Twitter data, we decided on several keywords in each subdomain as the phrase to be searched among Twitter's tweets.The keywords were chosen since they were assumed as part of viral issues in the Indonesian communities.Hence, it was hoped that many tweets would contain those keywords.Twitter data was collected using the Twitter API, and the collected data dates ranged from January 2020 to March 2023.The total amount of the collected data was 50 0,0 0 0.

b. Data Cleaning
We found a lot of data in the raw dataset that could not be used, such as duplicated data, unrelated data to biodiversity, including the three subdomains, and data not written in Indonesian.We then removed those data from the raw dataset, which produced 200,015 data.

c. Data Selection
After cleaning the data, we picked the data according to the proportion of collected tweets.The total number of chosen tweets was 15,323 for all three subdomains.Data selection was carried out by applying data sampling with the parameters of a 99% confidence level and a 1% confidence interval.Each subdomain had approximately 5,0 0 0 tweets, and each subdomain received an equal proportion in this dataset to ensure that it can effectively represent the biodiversity topics discussed.

d. Data Labeling
Before labeling each tweet in the raw data, a team of annotators comprised of 18 persons created a labeling guidance document [3] .The document included detailed instructions on annotating a tweet with a sentiment class label (positive, negative, neutral) or "none" label and some examples.In this labeling phase, a "none" label was introduced to categorize tweets that are unrelated to biodiversity.This was necessary because during the preceding data cleaning stage, many tweets were not filtered out due to the presence of keywords, even though they were not related to the biodiversity topic.These tweets included advertisements, recipes, excerpts from short stories/novels/dialogues, but they contained search keywords either in their content or in the hashtags used.Hence, human annotation was required to mark them with the "none" label.Generally, health, environmental management, and food security issues on Twitter are of a general nature, and in practice, they can be understood by every labeler, enabling them to provide reasonable judgments.
Referring to the labeling guidance document, each annotator labeled a set of 800 tweets from each subdomain functioning as a sample set.Three annotators labeled a tweet.The results were then verified by the majority voting method, and the level of inter-annotator agreements was measured.The labeling guidance document was revised according to issues found during the verification process.These steps were repeated until all problems were resolved, and the labeling guidance document was considered ready to be used as a reference in the labeling process.In the labeling guidelines we created, the labeling of tweets as news is considered based on references from several previous studies [4][5][6] .It has been indicated by these studies that news is not only served to present factual information but also used as a means to convey various emotional states, including sentiments like empathy, joy, apprehension, rage, and more.The importance of paying attention to headlines has been emphasized by these studies because news typically contains a high emotional content, as major national or international events are described, and they are written in a style intended to capture readers' attention.It has been suggested by these studies that these headlines potentially determine how many people read the news and even that headlines can influence the way users comprehend the entire related content and shape their perspective.
Table 4 below shows the characteristics of each sentiment class label defined in the labeling guidance document.

a. Data Preprocessing
We conducted a series of preprocessing phases [7] to generate a dataset (biodiver-sity_for_modelling.csv) for sentiment classification modeling.The initial stage encompassed converting all text to lowercase to ensure uniformity and eliminate potential stemming discrepancies from capitalization.Subsequently, the data underwent URL removal, and punctuation was substituted with spaces, except for apostrophes, while non-ASCII characters were replaced with their nearest ASCII equivalents.After that, we normalized slang and typos, standardized word variations, and mitigated the influence of slang and typographical errors.Then, numerical values and non-ASCII characters were excluded, and instances of multiple consecutive whitespaces were condensed to a single whitespace.Finally, stopword removal was implemented using the Sastrawi library [8] , except for adverbial words, which were retained in the stop words list due to their inherent significance and influence on sentence sentiment.These preprocessing steps resulted in the ready-to-be-trained dataset with a total of 13,435 data.

a. Sentiment Analysis Model Creation and Evaluation
We used two scenarios for creating sentiment analysis models.In the first scenario, we utilized the IndoBERT Tweet pre-trained model [9] .Two fully connected layers then used the sentence embeddings created from IndoBERT Tweet pre-trained model as input to generate the sentences sentiment class.We used three different configurations of activation function for the first connected layer: non-activation, Gaussian Error Linear Unit (GELU), and norm hyperbolic tangent (norm tanh).Among these, the configuration with the GELU activation function showed promising outcomes, achieving a best accuracy of 0.8263 and a best F1-score of 0.7959.We also experimented using preprocessed text dataset using the same setup and obtained accuracies ranging from 0.7765 to 0.7914, and F1-score varied from 0.7441 to 0.754.
In the second scenario, we used the Term Frequency -Inverse Document Frequency (TF-IDF) method to extract features of the labeled dataset, which were then used in various traditional machine learning algorithms such as Logistic Regression [10] , Support Vector Classifier (SVC) [11] , Light Gradient Boosting Machine (LGBM) [12] , Random Forest [13] , Extreme Gradient Boosting (XGB) [14] , AdaBoost [15] , and Decision Tree [16] .The Logistic Regression and SVC classifiers achieved the highest accuracies of 0.72115 and 0.71967, respectively, while LGBMClassifier, Random Forest Classifier, and XGB Classifier exhibited slightly lower accuracies.However, it is crucial to consider both accuracy and F1-score, which assumes precision and recall.The Logistic Regression and SVC classifiers also achieved the highest F1-score of 0.70619 and 0.70230, respectively.
Table 5 summarizes the performance of all classifiers, and the best model is the one created using IndoBERT Tweet pre-trained model with the GELU activation method.These results provide valuable insights into the most effective combinations of word embeddings and classifiers for sentiment analysis in biodiversity, offering guidance for future research in this domain.

Limitations
The dataset that has been published is limited to Indonesian Twitter data with a theme of biodiversity, encompassing health, environmental management, and food security issues.Therefore, preliminary testing is deemed necessary to apply our dataset to Indonesian Twitter data with different themes.

Table 2
Scores of inter-annotator agreement level (labeled dataset).

Table 3
Word Clouds of Sentiment Classes in Each Subdomain.

Table 4
Characteristics of sentiment class label as defined in the labeling guide document.Hal ini tidak lepas dari cetak biru BRIvolution 2.0 yang diterapkan sejak awal pandemi covid 19 " ("This cannot be separated from the BRIvolution 2.0 blueprint implemented since the beginning of the covid 19 pandemic.")"Direktur Informasi dan Komunikasi Pembangunan Manusia dan Kebudayaan, Dirjen Informasi dan Komunikasi Publik Kemenkominfo Wiryanta mengatakan, bonus demografi menjadi perhatian utama pemerintah ."("Director of Information and Communication for Human Development and Culture, Director General of Information and Public Communication of the Ministry of Communication and Informatics Wiryanta said that the demographic bonus is the government's primary concern.")None • Not related to biodiversity, such as containing product advertisements, food recipes, etc. • Containing non-Indonesian words • Containing absurd jokes or joking terms "Aku lagi nyoba jualan skincare & obat" herbal, udah lama sih jadi member, tapi gak diseriusin soalnya masih sibuk ditoko alat tulis sekolah, sekarang udah gak ada yg kesekolah baru deh kepikiran buat serius " ("I'm trying to sell skincare & herbal medicine, I've been a member for a long time, but I didn't take it seriously because I'm still busy at the school stationery shop, and now no one's going to school.I'm thinking about getting serious.")"06 Jan 2023 Status Terkini COVID-19 Malaysia Maklumat lanjut sila layari " ("06 Jan 2023 Current Status of COVID-19 Malaysia For further information, please review")