MABSA: A curated Malayalam aspect based sentiment analysis dataset on movie reviews

Regional languages are being used more frequently in online platforms as a result of the expanding use of digital technology. Understanding user opinions on social media platforms, forums, blogs, and other digital platforms that employ Indian regional languages has become significant due to their role in various applications. Research on sentiment analysis of Indian regional language texts suffers due to the unavailability of available regional language datasets. The curated Malayalam Aspect Based Sentiment Analysis (MABSA) dataset is a labeled dataset for Aspect Based Sentiment Analysis (ABSA) on the Indian regional language Malayalam over the movie review domain. Malayalam movie reviews, an excellent source of text data for ABSA, are collected from an online survey using Google form and manually collecting reviews from three social media platforms: IMDb, Facebook, and YouTube. Nine target aspects were identified, and three annotators annotated the dataset based on the sentiment polarity of each aspect. A total of 4000 reviews were collected, and a total of 7507 aspects are identified in the reviews. Spearman's correlation and Fleiss Kappa Index are used to analyze the annotated dataset's correlation. It has been found that the high correlation between the annotators implies that the MABSA dataset is of gold standard.

a b s t r a c t Regional languages are being used more frequently in online platforms as a result of the expanding use of digital technology. Understanding user opinions on social media platforms, forums, blogs, and other digital platforms that employ Indian regional languages has become significant due to their role in various applications. Research on sentiment analysis of Indian regional language texts suffers due to the unavailability of available regional language datasets. The curated Malayalam Aspect Based Sentiment Analysis (MABSA) dataset is a labeled dataset for Aspect Based Sentiment Analysis (ABSA) on the Indian regional language Malayalam over the movie review domain. Malayalam movie reviews, an excellent source of text data for ABSA, are collected from an online survey using Google form and manually collecting reviews from three social media platforms: IMDb, Facebook, and YouTube. Nine target aspects were identified, and three annotators annotated the dataset based on the sentiment polarity of each aspect. A total of 40 0 0 reviews were collected, and a total of 7507 aspects are identified in the reviews. Spearman's correlation and Fleiss Kappa Index are used to analyze the annotated dataset's correlation. It has been found that the high correlation between the annotators implies that the MABSA dataset is of gold standard.

Value of the Data
• MABSA dataset contains 40 0 0 movie reviews in Malayalam and a total of 7507 aspects which are manually labeled based on sentiment polarity. To our knowledge, this will be the first publicly available dataset on Malayalam sentiment analysis. Based on Malayalam movie reviews, researchers can create new algorithms and models for sentiment analysis in the entertainment sector using the MABSA dataset. This may result in fresh perspectives and advancements in exploring natural language processing in Malayalam. • Malayalam is a low-resource Indian regional language [1] . Hence this dataset will benefit the regional languages in modeling/proposing machine languages and deep language learning models for different applications like recommender systems, next-word prediction, etc. • Malayalam is a morphologically rich Indian language that 38 million people worldwide speak [2] . The dataset on ABSA can provide light on local sentiment by analyzing sentiment conveyed in Malayalam text, which may differ from sentiment conveyed in other languages like English.

Objective
The primary objective of creating the MABSA dataset is to develop a high-quality, labeled dataset that properly depicts the attitudes expressed towards particular aspects or features of movies described in the Malayalam language. Since Malayalam is a low-resource language, the contribution of the MABSA dataset will be a huge asset to the research community.

Data Description
ABSA often relies on movie reviews as a source of data since they tend to offer detailed reviews and comments on diverse features or aspects of the film. MABSA dataset for movie reviews aims to pinpoint the individual aspects of the film that are expressed within the review comment in the Malayalam language and evaluates the corresponding sentiment orientation of every aspect present in the text. Sentiment orientations are mapped to positive, negative, or neutral depending on the words or phrases that describe each aspect. The description of each column in the MABSA dataset is presented in Table 1 . This column contains the approximate translation of the reviews in the English language.

Direction
This column contains the sentiment label for the 'direction' aspect of the movie. Songs This column contains the sentiment label for the 'songs' aspect of the movie. Acting This column contains the sentiment label for the 'acting' aspect of the movie. Screenplay This column contains the sentiment label for the 'screenplay' aspect of the movie. Story This column contains the sentiment label for the 'story' aspect of the movie. Casting This column contains the sentiment label for the 'casting' aspect of the movie. Editing This column contains the sentiment label for the 'editing' aspect of the movie. Cinematography This column contains the sentiment label for the 'cinematography' aspect of the movie. BGM This column contains the sentiment label for the 'background music' aspect of the movie.

Experimental Design, Materials and Methods
The workflow process for MABSA dataset creation is depicted in Fig. 1 . Movie reviews in Malayalam are collected through two ways. The former is through a Google form, asking people to comment reviews about 20 movies. The latter is from various groups and pages on social media platforms, viz. Facebook, YouTube, and IMDb websites. Reviews were collected based on the presence of at least one target aspect in the review, where the target aspects include direction, songs, acting, screenplay, story, casting, editing, cinematography, and background music. These nine target aspects are short-listed from a set of aspects related to the film industry, which people predominantly use when writing a review comment. A total of 40 0 0 reviews of various movies are collected and stored in an Excel sheet.
In the next stage, these reviews are given to three proficient annotators for labeling. No specific protocol was followed to carry out the task of annotation. The top-ranked aspects of movie reviews are identified as the target aspects and given for annotation. The three annotators are native Malayalam speakers who are capable of identifying the aspects and any other term synonymous to the targeted aspects. Based on the nature of the comment on the target aspects present in a review, the annotators quantify the sentiment of each aspect as 1, 0, and 0.5 for positive, negative, and neutral, respectively. Table 2 shows an example of a movie review in Malayalam (with English translation). After annotating the entire collection of reviews, 7507 aspects are identified totally in the dataset. Table 3 presents the distribution of aspects across different sentiment classes which is illustrated in Fig. 2 .  As a final step, the correlation between the three annotators is calculated using Spearman's correlation and Fleiss Kappa Index. These two methods will help to understand how well the agreement between the three annotators have been in labeling the dataset. Eq. (1) have been used to calculate Spearman's correlation [3] .
Where ρ is the Spearman's correlation coefficient, d is the difference between the two annotator's agreement on aspect sentiment and n is the number of annotations. The annotation corre- lation was also evaluated using the Fleiss Kappa Index using Eq. (2) [4] .
Where κ is the Fleiss Kappa value, P is the observed proportion between a pair of annotators agreement and P e is the expected proportion between the annotator agreement. Table 3 summarizes the results of the two correlation analyses. The results of the correlation analysis on annotator agreements exhibit a high correlation between the three annotators. Hence the dataset created is a gold standard for Malayalam ABSA on movie reviews. The annotator 1 dataset, one among the high correlations exhibiting datasets between the annotator combinations, is chosen as the final dataset for MABSA. Now this labeled dataset in Excel format is converted to CSV format for easy usage of the dataset ( Table 4 ).

Ethics Statements
Data collected was completely anonymous, and the data redistribution policies of social media platforms have been complied with.