Knowledge Extraction of Gojek Application Review Using Aspect-based Sentiment Analysis

Abstract


A. Introduction
The rising number of social media users has enormous impact on several industries, especially in terms of user decision-making.Sharing experiences on social media platforms encourages the communication of information, empowering users to make wise decisions.This heightened word-of-mouth has the ability to establish markets and enhance brand equity and company financial success.Gojek is a mobile application that accessible through Google Play Store or App Store that leading firm in Indonesia's creative economy industry has showcased its achievement by becoming the first Unicorn Company in 2016 and afterwards incorporating with Tokopedia in 2021 to establish GoTo.Gojek has consistently received the Brand Comparison award at the Top Brand Awards over the past five years, showcasing its dedication to quality and resilience [1] that would be feasible because Gojek really pay attention with their customer's feedback.Either customer's feedback on social media or comment and rating of the application.Nevertheless, the substantial and unrefined review data necessitates thorough processing in order to be utilised for sentiment analysis in order to comprehend and categorise opinions into several categories based on emotions and feelings.
The are two types of approaches used in conducting sentiment analysis, Machine Learning approach and Knowledge-based or lexicon-based [2], [3] Sentiment analysis itself is a field of study of analyzes people's opinions, judgements, attitudes, and emotions towards entities [2].Several types of sentiment analysis able to conducted all at once to conduct a comprehensive study of the feedback, Saddam & Dewantara conduct SA for examinations of managing flood disaster in Jakarta [4], M. A. Jassim et al., purpose SA for new rating prediction of new films [5].Knowledge base also aimed at serving business practices which primarily utilize by large organization or even individual that creates and consumes distributed knowledge.Knowledge is usually at a higher level of abstraction than a single item of a fact which can manually extracting evidence on behavior determinants related to specific types of behavior for specific social groups, although extremely laborintensive and challenging to collect and synthesize all knowledge [6], [7].Provides heterogeneous information including both structured and unstructured data with different semanctics, knowledge base can help develop insight on problems which difficult to uncover [8], [9].With Knowledge-based approach, we can utilize Knowledge Extraction (KE).KE is the process of extracting information and its relationship, generalizing the information and storing it in a structured manner in XML or Knowledge base format so that can be easily accessed and inferred.The extrated knowledge must be in machine-readable and machine intepretable format and must represent the knowledge in a way that facilitates inference.KE can use information extraction techniques which aim to extract (explicit) information with certain categories from a collection of documents [6], [10].Since KE is aims to find entities, relations and event involving those entities from unstructured data and link them into existing knowledge bases, KE can be utilizing with Aspect Based Sentiment Analysis (ABSA).
ABSA which is one of the levels of sentiment analysis (SA) that has been considered the concept-level, focuses on the semantic analysis of the text throught the use of web ontologies and semantic networks [11], [12].ABSA is paved the way to novel approaches for a better understanding, having process in different aspects like attributes, characteristics, or feature of product or service that provides benefits for a better aspect-aware text representation [13], [14], [15].ABSA focusing on two tasks there are, Aspect Term Extraction (ATE) and Aspect Polarity Classification (APC).ATE work to identify different aspect mentioned in given sentence, refer to specific characteristic of product or service discussed in the feedback [16].It is related to KE meaning which is the creation of knowledge from structured (relational databases, XML) and unstructured (text, documents, images) sources which contributes to establishing, improving, and affecting the knowledge that potentially applied in SA [6], [7], [17].
Previous studied that implied ABSA in KE is addresses automatic KE for ABSA in product review to introduce an approach to obtain a knowledge-based system to capture product aspects in specifi domain [18], in addition studies is about purpose incorporating multiple lexical knowledge sources into fine-tuning process of pretrained transformer models of Targeted Aspect-based Financial Sentiment Analysis (TABFSA) [19].According to the passage, the author studied KE using the ABSA method cause word identification in KE and ABSA are comparable and has relation in it.In this study, we utilize ABSA to construct KE which will be form in XML and open data so that can be reused for future research.

B. Research Method
The method of knowledge extraction of Gojek application reviews using ABSA will be used ABSA.The stages of this research can be seen in figure 1 below.

Data Collection
The very first stage of this reseach is data collection.Google Colaboratory uses Google Scrapper, one of the Python's libraries to collect data from Gojek application reviews on Google Play Store.The amount of the data is 400 which contains with the most recent feedback in Bahasa language.

Data Pre-processing
After collecting data, the collected data has to be separated from unstructured and duplicated data.The data was also cleared from null, normalization; process transforming data into a standardised format data, tokenizing; breaking down the structure of sentences into words and stopwords; removing words that have no potential or effect on the classification process; and stemming; elaborates senteces to find the basic word.

Data Labelling
In this step, the data will be divided into two parts: positive and negative.The rating with 4 and 5 was labeled as positive and ratings with 1, 2, and 3 were labelled negative.The purpose of labelling data used to train the system for the recognition of the pattern that is sought while testing the data once the result of the training has been caried out.

Define Aspects
Since this method use ABSA, we have to define the aspects that will be used to analyze.The researcher defines two aspects: Harga and Driver in Bahasa language whose contains with positive and negative words that can be seen on the table below.

Word Weighning
After defining the aspects, before we start the analysis we have to weigh the word using Term Frequency -Inverse Document Frequency (TF-IDF).TF-IDF is a method used to give a weigh for each words in data based on the its relevance.

Split Data
For this research, we need to split the data into data train and data test.This used for evaluate the prediction results with ratio 0.2 using scikit-learn to generate resulting 20% for data test and 80% for data train.

Utilizing ABSA
At this stage, the split data processed for modelling and analysing it using ABSA.This study use SVM models with linear, polynomial, and Radial Basis Function (RBF) kernels to fully analyze the aspects.

Cross Validation
Evaluate the model performance by dividing the dataset into the smallest subset.Training and testing data will be done alternatingly in every subset of data.

Tuning Parameter with GridSearchCV
The optimization process of the model parameter is done by finding parameter combinations from the score list that have already been specified.After seeing the classification result against the models, we extracting the information into uncover a new knowledge.KE have emerged as a powerful approach across various fields, facilitating automatic acquisition and representation of valuable insight from each sample [20] with process involves classifying the extracted information to ensure its generality, accessibility, readability, and machine interpretation.As the result of extraction knowledge from ABSA we can conclude if the sentiment of Gojek application reviews for this past year decreased mainly on driver aspect.Due to drivers characteristics, the majority of consumers complaints about the unruly behaviour of Gojek's Driver.Followed by the pricing adjustement that increases.Knowledge should be shared between members in organization and between organizations.Nevertheless, the KE of this analysis will be used by the company in considering the business goals and supporting in company's business planning.

Table 3. Tuning parameter code program
KE that has been generated can be storing using XML format.In recent years, storing knowledges in XML format has gained much popularity and has lead the interest to storing of large data repositories in XML format.The flexibility and expressive nature of XML allows to organize knowledge in textual contents into hierarchical structures and a standard model to store and transport data [21].
Afterwards, the outcomes of the ABSA which were transformed into KE and recorded in XML format possibly seen as this follows.This research present information regarding the outcomes of ABSA on the Gojek application reviews.Performing SVM with Kernel models, linear, polynomial, and RBF resulting no significant differences for Harga aspect analysis.Instead, driver aspect show differences in linear models for cross validation and parameter tuning result.This may occur due to a lack of sufficient data during the processing phase.The KE in this research also stored in an XML format which expected to be used by companies or other researchers to obtain information of the performance Gojek application toward user's reviews.Expected for futher research to explore alternative methods of analysis to achive optimal and superior outcomes.

Figure 4 .
Figure 4. Knowledge extraction from ABSA Gojek reviews in XML format

Table 1 .
Aspects for Analysis

Table 2 .
Cross validation code program

Table 7 .
Result of cross validation tuned parameter (test dataset)

Table 11 .
Result of cross validation tuned parameter (test dataset)

Table 12 .
Result of cross validation tuned parameter (train dataset)