Aspect-based sentiment analysis for hotel reviews using an improved model of long short-term memory

Advances in information technology have given rise to online hotel reservation options. The user review feature is an important factor during the online booking of hotels. Generally, most online hotel booking service providers provide review and rating features for assessing hotels. However, not all service providers provide rating features or recap reviews for every aspect of the hotel services offered. Therefore, we propose a method to summarise reviews based on multiple aspects, including food, room, service, and location. This method uses long short-term memory (LSTM), together with hidden layers and automation of the optimal number of hidden neurons. The F1-measure value of 75.28% for the best model was based on the fact that (i) the size of the first hidden layer is 1,200 neurons with the tanh activation function, and (ii) the size of the second hidden layer is 600 neurons with the ReLU activation function. The proposed model outperforms the baseline model (also known as standard LSTM) by 10.16%. It is anticipated that the model developed through this study can be accessed by users of online hotel booking services to acquire a review recap on more specific aspects of services offered by hotels

Advances in information technology have given rise to online hotel reservation options. The user review feature is an important factor during the online booking of hotels. Generally, most online hotel booking service providers provide review and rating features for assessing hotels. However, not all service providers provide rating features or recap reviews for every aspect of the hotel services offered. Therefore, we propose a method to summarise reviews based on multiple aspects, including food, room, service, and location. This method uses long short-term memory (LSTM), together with hidden layers and automation of the optimal number of hidden neurons. The F1-measure value of 75.28% for the best model was based on the fact that (i) the size of the first hidden layer is 1,200 neurons with the tanh activation function, and (ii) the size of the second hidden layer is 600 neurons with the ReLU activation function. The proposed model outperforms the baseline model (also known as standard LSTM) by 10.16%. It is anticipated that the model developed through this study can be accessed by users of online hotel booking services to acquire a review recap on more specific aspects of services offered by hotels.
ABSA is a feeling-based approach that considers the entity type and the aspect [1]. ABSA involves aspect term extraction, category detection, and category polarity [2]. Several studies on the application of ABSA in Indonesian-language reviews, both open-domain documents and domain-dependent documents, have been developed. One study that focused on aspect extraction and opinion terms used the coupled multilayer attentions (CMLA) mechanism and double embeddings [3], whereas another used the Bidirectional Encoder Representations from Transformers (BERT) transfer learning mechanism [4]. The first study adapted CMLA, which was proposed by [5] and double-word embeddings using fastText proposed by [6]. Subsequently, the testing process involved combinations of these methods with several RNN architectures such as gated recurrent units (GRU), bi-directional GRU (Bi-GRU), long short-term memory (LSTM) and bi-directional LSTM (Bi-LSTM). The best combination was observed with Bi-LSTM. The model using Bi-LSTM achieved an F1-measure of 0.918 and 0.9 for the term aspect and opinion term extraction, respectively [3]. The second study implemented BERT-based transfer learning proposed by [7], and it attained F1-measures of 0.87 and 0.89 for the aspect term extraction and opinion term extraction, respectively [4].
In addition, various ABSA studies for the Indonesian language have been developed related to reviews of clothing distro [8], restaurants [1], [9]- [11], and marketplaces [12], [13]. The study of ABSA for clothing distro reviews was conducted by implementing the Naive Bayes classifier and bag-of-words as the feature extraction method. This effort attained 89.86% and 97.24% for recall and precision, respectively [8]. Subsequently, the ABSA study for restaurant reviews proposed by [9] examined ABSA regarding aspect term and opinion term extraction. This study used the conditional random field (CRF) classifier [14] to predict the aspect term and opinion term. The proposed model achieved an F1-measure of 0.794. The third ABSA study concerned aspect term extraction, which was proposed by [1]. They focused on identifying the best combination of feature extraction methods for ABSA specifically for aspect term extraction. The combination of continuous bag-of-words (CBOW) and global vectors (GloVe) was found to be the best for feature extraction. This combination achieved an F1-measure of 0.642 for sentiment polarity. The study by [10] investigated aspects of term detection and orientation. They used combinations of lexicon and semantic orientation to identify the aspect terms of the review. This model attained F1-measures of 0.8840 and 0.7576 for term extraction and aspect orientation aspects, respectively. A study about the combination of classical machine learning methods and the appropriate feature extraction method for ABSA was proposed by [12]. They achieved an F1-measure of 0.92 using the support vector machine (SVM) model. Furthermore, a study by [11] investigated ABSA with respect to aspect category classification, opinion target extraction, and sentiment polarity extraction. They used a CNN model for aspect category and sentiment polarity classification. CRF and Bi-LSTM was used for opinion target extraction. The model achieved an F1-measure of 0.87, 0.78, and 0.764 for aspect category classification, opinion target extraction, and sentiment polarity extraction, respectively. The study of ABSA with respect to aspect term detection and sentiment classification using the combination of Bi-GRU and GRU was proposed by [13]. The best model identified through this study achieved an F1-measure of 0.9326.
ABSA studies conducted outside Indonesia include those conducted by Tang [15], Al-Smadi [16], Akhtar [17], Alqaryouti [18], and Liu [19], [20]. Tang's ABSA study focused on fine-grained data using datasets from Amazon and Yelp [15]. He examined a joint aspect-based sentiment topic (JABST) model, which is a combination of topic modelling and ABSA [16], in addition to a MaxEnt-JABST model, which is a combination of max entropy and JABST. In this study, the performance of the MaxEnt-JABST model was better than the JABST as the baseline model, with an increase in accuracy level of 5%. Al-Smadi conducted ABSA research on Arabic hotel reviews, focusing on sentence-level granularity [16]. The SVM and RNN methods were investigated in this study. SVM is known for its capacity to perform binary classification, whereas RNNs manage sequential data better than CNNs. The results from this investigation show that the performance of SVM is superior to the other tested models. However, when it comes to training and testing, the performance of the RNN proved to be superior to that of SVM. Akhtar conducted ABSA research and topic modelling on the hotel reviews data from TripAdvisor [17]. ABSA was employed to identify the aspects contained in the review, and subsequently, topic 393 Vol. 8, No. 3, November 2022, pp. 391-403 modelling was conducted based on these aspects. Alqaryouti's investigation focused on government application review data [18]. This undertaking delved into a combination of integrated lexicon and rulebased ABSA. In the context of quality, the results revealed the superiority of integrated lexicon and rulebased ABSA over other lexicon baselines and rule-based methods. Integrated lexicon also comes with the capacity to manage sentiment analysis issues, such as implicit and explicit aspects, as well as negation. Liu conducted ABSA research to find a new algorithm or method to outperform existing state-of-theart methods. In [19], Liu focused on developing a recurrent memory neural network (ReMemNN). This method was an improvement over MemNN [21], which consists of embedding adjustment learning, multi-element attention modules, and explicit memory modules. Multi-element attention modules were an improvement over binary attention that was used in [21]. This research used cross entropy as the objective function and accuracy as the performance metre. Performance of this proposed method outperformed state-of-the-art methods in almost all datasets. Liu also conducted other ABSA research [20] to find a method to overcome the weaknesses of RNNs, CNNs, attention methods, and memory networks. This research proposed a gated alternate neural network (GANN). GANN consists of convolution, max-pooling, gate truncation RNN (GTR), and fully connected layers. Convolution and max-pooling are used to divide sentences into sentiment clues and capture local features to overcome the RNN weakness of long-term dependency. GTR can capture denoising informative aspect-dependent sentence clue representation and is used to overcome CNN weaknesses. This method also used a concept of a memory network that viewed sentences not as facts and viewed aspect as a query. This research [20] used four datasets in Chinese and three datasets in English, and the proposed method outperformed state-of-the-art methods in ABSA.
The rest of this paper is structured as follows: section 2 describes the research methodology, section 3 analyses and discusses the research results, and section 4 presents the research conclusions.

Method
The stages of this investigation are displayed in Fig. 1. The process begins with the collection of data from the Traveloka website. This data is subjected to pre-processing in preparation for the training of the word2vec model, which is used for feature extraction. The feature extraction results are then used for the embedding of data at the model training stage. The data are separated into training data and validation data. The purpose of the training stage is to train the models for each combination of parameters. The results from this stage are scrutinised at the model-testing stage to determine the performance of each constructed model. The model-testing phase uses data that are distinct from the data used in the training stage. This data are also obtained from the Traveloka website. The performance of each constructed model is assessed during the testing phase. A confusion matrix is used to evaluate the output of the testing phase with the micro-average F1-measure as the performance metric.

Collecting, Labelling, and Splitting Data
Selenium was used to acquire data by crawling reviews on the Traveloka website. Ten randomly selected hotels were the source of 2,700 data elements acquired by crawling. In addition to this data, we included 2,500 data elements used in the study conducted by [22], bringing the total dataset collected to as many as 5,200 hotel reviews as the research dataset. We split the dataset into two categories, the training-validation dataset and the testing dataset. Training-validation data consisted of 5,000 reviews with of 10,283 aspects. The data were labelled according to the aspects and sentiment polarity of these aspects identified in the review. The aspect labels used in this study were "makanan" (food), "kamar" (room), "layanan" (service), "lokasi" (location), and "lainnya" (miscellaneous), and the sentiment aspect polarities used were positive, neutral, and negative. Each review was required to contain at least one aspect.
Moreover, the testing data consisted of 200 reviews, which were also labelled as training-validation data. Table 1 exhibits the data distribution for each aspect. The one-hot encoding method was employed to label the data. Each data label holds 15 binary elements, which represent combinations of five aspects with three sentiment polarities for each aspect.

Pre-processing
Pre-processing is concerned with cleaning and preparing data for classification [23]. It involves case folding, stop-word removal, stemming, tokenisation, padding, and vectorisation. The case folding process renders all the characters in the data similar in kind, whether in lower case or upper case [24]. Stop-word removal improves machine learning performance through the elimination of conjunctions [25]. Stemming is the process of reducing words into uniform basic stems [25]. Tokenisation implies converting sentences formed from stemming into smaller parts represented by words referred to as tokens [25]. Padding involves levelling the tokenised data length by insertion of dummy tokens behind the original data. Vectorisation renders the data numeric through references to the constructed dictionary. The construction of the dictionary is based on all the words in the data.

Training the Word2Vec Model
This study used word2vec, a form of feature extraction method introduced by [26]. It can train a substantial quantity of data in a relatively short time [26]. The word2vec parameters used in this study were ascertained through the skip-gram model and the employment of hierarchical SoftMax as the evaluation method [22]. The vector size of this study was 300, as the output vector of the word2vec is anticipated to increase in tandem with the significance of the dataset used [22].

Training and Testing Process for Aspect and Aspect Polarity Detection
In this stage, a model based on LSTM was constructed to detect aspect and aspect polarity. In this study, those two tasks were within the scope that can be handled through ABSA. In other words, this research could detect not only the polarity of aspects in a review, but also various aspects discussed in a review. In general, the implementation of ABSA follows the LSTM-based architectural standards, as illustrated in Fig. 2.  Equation (3) is for the forget gate. Equation (4) is for the cell state, and equation (5) is for the candidate gate. Each cell in LSTM has one output from the candidate gate and one hidden state. Equation (6) is used to calculate the result of the hidden state output.
This study used the LSTM model architecture, together with fully connected layers, as depicted in Fig. 3. The input from the model, generated at the pre-processing stage, is put through the embedding layer using the matrix embedding weights. Following the insertion of the input data into the embedding layer, the data are routed into the LSTM layer. The embedding matrix is constructed from the pair of words and its word2vec vector. The LSTM layer produces a two-dimensional matrix of the same size as the input embedding matrix. Before entering the fully connected layer, the LSTM output must be changed to one dimension using the flatten layer followed by two fully connected layers with size and activation functions as tested in this study. As shown in Fig. 3, this study uses standard LSTM architecture in [28] as the baseline model. The final layer of the baseline model is adjusted to correspond to the size of the data label. During the training stage, we considered several combinations of parameters to determine the best model. Table 2 shows the combinations of parameters considered.

Fig. 3. Proposed Architecture
The testing phase involves a process designed to identify the best combination of parameters for the proposed architecture model. The data used during the testing stage are different from the data used during the training stage. The most useful model is the one with the best combination of parameter values. This study considered four research scenarios to find the best values for the parameters. The running of each scenario is based on the combination of parameters shown in Table 2. The first scenario determines the best size in the first fully connected layer, the second scenario determines the best activation function in the first fully connected layer, the third scenario determines the best size in the second fully connected layer, and the fourth scenario determines the best activation function in the second fully connected layer.

Evaluation
For evaluation, we opted for the micro-average F1-measure because it is highly responsive to the most common classes or labels [29]. The confusion matrix plays a supporting role during the microaverage F1-measure calculation. An example of the confusion matrix is provided in Table 3. In an ideal confusion matrix, the main diagonal entry value is more significant than the other entries. That confusion matrix produces an F1-measure value close to 1.  The F1-measure formula is expressed in Equations (7), (8), and (9).
In these equations, TP represents true positive, FP represents false positive, and FN represents false negative. TP for each sentiment is the main diagonal. FP is located one row below, while FN is in one horizontal column. Based on Equations (7) and (8), if the value of the denominator is similar for precision and recall, it follows that the value of the micro-averages of precision and recall would also be similar. This leads to the materialisation of Equation (11), which contends that the value of the micro-average F1-measure is equal to that of precision and recall.

Results and Discussion
As mentioned before, we used 200 reviews as testing data, and six scenarios were observed in this study. In the first testing scenario, the performance of the model was grouped based on the size of the fully connected layer 1. The combination of parameters used for the fully connected layer 1 can be observed in Table 2. The results of this scenario are exhibited in Fig. 4.  Fig. 4 shows no statistically significant link between the size of the fully connected layer and the performance of the model. It can be observed that there is an upward movement from 700 to 1,200, though there are some anomalies in the middle of this upward movement. These anomalies stem from the random initialisation of weights during each iteration. As illustrated in Fig. 4, the model with 1,200 neurons as the size of fully connected layer 1 is the best model. It has a micro-average F1-measure of 0.7488. The model achieved this value because a smaller number of neurons in the hidden layer led to a higher probability of providing incorrect information to the next layer [30] and vice versa.
In the second scenario, the performance of the model was grouped based on the activation function used on the fully connected layer 1. The activation functions investigated were tanh, ReLU, and sigmoid, which were applied for each combination of parameters. The performance evaluation of this scenario was based on the mean of the micro F1-measure. The results of this scenario can be observed in Fig. 5.
The results showed that the tanh activation function, with a micro-average F1-measure of 0.7462, delivered the best performance. ReLU, with a micro-average F1-measure of 0.7419, had the worst performance. The performance of the sigmoid function was close to that of tanh; both functions have an upper limit value of 1, which facilitates the convergence of the results to the desired range of 0 to 1 [31]. By using tanh, the results did not converge too quickly, and various data patterns could be recognised. In the third research scenario, the performance value of the model was grouped based on the size of fully connected layer 2. Table 2 portrays the size of fully connected layer 2. The results for this scenario were calculated in the same manner as in the other scenarios, i.e., using the micro-average F1-measure for each group. The performance models in this scenario can be observed in Fig. 6.
According to Fig. 6, there is an upward pattern in the model's performance until the size of 400 neurons, beyond which the performance level remained stable. This result is in contrast with the first scenario, in which no specific pattern was discerned. Several anomalies were apparent in the performance of the model. These can be connected to the random initialisation of weights for each iteration. While the upward pattern for this scenario increased until 400 neurons, the result indicated 600 neurons to be the best size for fully connected layer 2. A size of 600 neurons registers a micro-average F1-measure value of 0.7590, which is superior to the micro-average F1-measure of 0.7686 attributed to a size of 400 neurons. These results are in agreement with those realised by [30], which verified that the smaller the neuron size in the hidden layer, the more likely the failure to convey information to the next layer. In the fourth research scenario, the performance of the model was grouped based on the activation function used at fully connected layer 2. As in the first fully connected layer, the tanh, ReLU, and sigmoid activation functions were considered. The results of this scenario are displayed in Fig. 7.  Fig. 7 reveals that ReLU, with a micro-average F1-measure value of 0.758, is the best activation function at fully connected layer 2. The least favourable activation function is the sigmoid function, which results in a micro-average F1-measure of 0.72. The ReLU activation function in this architecture can limit the lower value of the resulting neuron to 0, which corresponds to the desired result range [31].
The best combination of parameters derived from the four scenarios is utilised for the construction of the ideal model. The best parameters are a neuron count of 1,200 for fully connected layer 1 using the tanh activation function and a neuron count of 600 for fully connected layer 2 with a neuron count of 600 using the ReLU activation function. The constructed model achieved a micro-average F1-measure of 0.7528 when tested using the same data.
During the testing phase, several misclassifications were detected in the model derived through this process. The first is attributed to the inappropriate use of the aspect and sentiment pairs. For example, "Saat pertama masuk lobby, bau pekat rokok, tapi di kamar, tidak ada bau rokok sama sekali, bersih, luas. Beberapa jam saya turun ke lobby, bau rokok sdhbtdk ada lagi, wangi. Proses check-in, check-out cepat. We'll be back soon. Thanks Epic!" (When you first enter the lobby, there is a thick smell of cigarettes, but in the room, there was no cigarette smell at all, clean, spacious. A few hours later, I went down to the lobby, and there was no smell of cigarettes anymore, fragrant. Check-in, check-out is fast. We'll be back soon. Thanks Epic!). This review should register positive sentiments in the "kamar" (room) aspect and negative sentiments in other aspects. However, neutral sentiments were recognised in the room aspect as "bau" (smells) and "rokok" (cigarettes) in the training data are more often associated with the "kamar" (room) aspect rather than the "lainnya" (miscellaneous) aspect.
The second source of misclassification was the inappropriate use of a foreign language mixed with the Indonesian language. For example, "Staff-nya ramah banget, breakfast-nya enak" (The staff is very friendly, the breakfast is delicious). This review could ganer a positive sentiment for the "makanan" (food) and "layanan" (service) aspects. However, it was recognised as a positive sentiment for the "layanan" (service) aspect, but the model did not recognise the "makanan" (food) aspect. The inappropriate integration of a foreign language with the Indonesian language diminished the quality of the pre-processing of data, particularly when it came to stemming and stop-word removal. In the example above, the word "breakfast-nya" cannot be appropriately recognised, as this word cannot be stemmed in Indonesian.
The third source of misclassification was the size of the training dataset, which was relatively small. Due to the lack of training data, the model only recognised limited patterns. The limited training data not only limited pattern recognition but also caused an imbalance in data for each aspect, thereby affecting the model's performance in recognising patterns for certain aspects. Hence, the model was good at recognising one aspect but not the other aspects.
In this study, we also compared our proposed model's results to those of the baseline LSTM architecture. Both architectures were trained and tested with the same 5,000 training and 200 testing data elements. According to the test results exhibited in Table 4, the performance of our proposed model surpassed that of the baseline model. The most significant difference came from the proposed model's two additional fully connected layers before the output layer. The sigmoid activation function was used in the output layer. As demonstrated in Table 2, the installation of additional fully connected layers enhanced the model's capacity for data pattern identification.

Conclusion
A system for recognising hotel quality based on reviews can be established by obtaining review data, pre-processing the data, using a model for recognising the data, then obtaining the aspects and sentiments of the reviews. Aspect and sentiment data can be processed to be more understandable and appealing to customers. The best model architecture was realised through a combination of LSTM and two fully connected layers (fully connected layer 1 with 1,200 neurons using the tanh activation function and fully connected layer 2 with 600 neurons using the ReLU activation function). The proposed model, with a micro-average F1-measure of 0.7528, outperformed the baseline model by 0.1016 (10.16%) in the F1-measure.