Sentiment Analysis of Amazon Product Reviews using Supervised Machine Learning Techniques

ABSTRACT

Today, everything is sold online, and many individuals can post reviews about different products to show feedback. Serves as feedback for businesses regarding buyer reviews, performance, product quality, and seller service. The project focuses on buyer opinions based on Mobile Phone reviews. Sentiment analysis is the function of analyzing all these data, obtaining opinions about these products and services that classify them as positive, negative, or neutral. This insight can help companies improve their products and help potential buyers make the right decisions. Once the preprocessing is classified on a trained dataset, these reviews must be preprocessed to remove unwanted data such as stop words, verbs, pos tagging, punctuation, and attachments. Many techniques are present to perform such tasks, but in this article, we will use a model that will use different inspection machine techniques.
There is an emphasis in [8]. This is the definitive Flipkart feedback study using algorithms from the Bayes Naïve and Decision Tree. Using the product ratings and reviews of the single data set of Flipkart sellers and its classification, the subjectivity and objectivity, and that the buyer is negative to the positive meaning of the term. These assessments were, to a certain degree, positive and prospective both for your purchasers and for your providers. It is an observational research analyzing the efficacy of the semantic significance of the product evaluation categorization.
In [9], feedback from numerous e-shopping websites is evaluated. Analyzing ratings for online shopping sites is the primary goal of the framework. The ratings are categorized according to positive, negative, and neutral. Such findings help pick a specific e-shopping website based on the highest favorable reviews and scores. Firstly, the data collection of e-shopping websites providing ratings relevant to the services of individual websites is gathered. Then, add specific preprocessing methods to datasets to delete unwanted items and organize details correctly. After that, we use the POS tagger to assign tags according to the position of each phrase. To find the Score of each word, "sentiwordnet dictionary" is used. Sentiments then Positive, negative, and neutral are graded. In the graphical style, the comparison of the providers based on positive and negative feedback can be seen.
This paper aims to distinguish customers' positive and negative feedback of various products and develop a supervised learning model to polarize large quantities of reviews. Our dataset consists of feedback and ratings from consumers that we received from user reviews of Amazon products. Based on that, we extracted the features of our dataset and established several supervised models. Such models provide algorithms for supervised machine learning such as Naive bays, logistic regression, support vector machines, Ensemble Classification, Decision Tree, and K-nearest neighbor. At last, we will compare all the models and check each model's accuracy with the ROC curve, recall, and precision.

A. Data Preprocessing
We take the dataset from reviews of Amazon Products [3]. Our dataset has 483148 of the total reviews. In this case, the product name, Brand, price, rating, text of the review, and the review of the device's cast. We will review in the review column to better use the data for the first, as they are the most critical aspects of this project. We separate positive and negative reviews below. Figure 1 is for positive reviews, and it is for negative reviews. Besides the brief overview of the dataset, we have plotted a distribution of ratings concerning the number of reviews, and we also perform the task where it calculates the total number of reviews with ratings 5,4,3,2,1. it shows There are five classes in our dataset, which is the rating starts from 1 to 5 stars, as well as the division among them the five classes have been wrong, which is a class 2 and 3 with a small amount of data, while grade 5 has more than 175000 reviews. Here is an example from our data set: a Revision of the text: "I am using this phone, this is amazing, Rating: '5'. The rating distribution of Amazon reviews can be seen in Figure 2. For the research purpose of this project, we filtered the dataset with 16000 reviews and then again separated based on the review's rating.

B. Features
We have tried two types of features in our project. The first type is CountVectorizer [10]. The text must be analyzed to remove some terms to use textual data for predictive modeling, and it is also called the tokenization procedure. These words must then be encoded as integers or fluid-point values for machine algorithms as inputs. This procedure is known as function removal (or vectorization).
We use a Scikit learn library of CountVectorizer to convert a text collection into a vector of term/tokenization. This functionality makes it more flexible for text representation.

count_vector=CountVectorizer(stop_words="english")
The other method is TFIDF [11]. It is a statistical metric that assesses the significance of a word about a document in a collection of documents. This is because two components are multiplied: the number of times the term is in a document and the other way round the frequency of a document. tfidf_vector = tfidfVectorizer(stop_words="English") tfidf_vector.fit(X_train_data).

C. Classification
This research used six classification methods. The first is naïve bayes. The Naïve Bayes classification algorithm uses the alien of the theorem of Bayes to forecast the text tag based on the knowledge of its rules, terms, and circumstances [12]. It evaluates the chance of every tag being a text and then forecasts the time as likely as possible.
One of the most frequent tasks is the classification problems learning methods. In this approach, it is supposed that the is dependent on the , termed the assumption of Naïve Bayes. The calculation of naïve bayes as in (1).
Second, utilized logistic regression to fix the binary classification problem using a classification technique in the classification of logistic regression, which utilizes a weighted combination of input and much effort [13]. The function Sigmoid transforms an actual number a to a number from 0 to 1.
A logistic regression classifier on Count Vectorizer and TFIDF features to compare it with rating accuracy. The default parameters that give us the accuracy of the results will be shown in the Results section. Logistic regression work with a sigmoid function, which predicts that the outcome values range from 0 to 1 or true false. The visualization of logistic regression can be seen in Figure 3. Third, a non-parametric classification procedure is the K-nearest neighbor (KNN). In recent years it has been frequently utilized. This approach is the closest neighbor of the input data to create a forecast for the first time for = . The great majority of the class's neighbors should then be mentioned. The distance between each neighbor and the distance Euclidean is a measure of the extent of similarity between the data points [14]. The equation of logistic regression as in (2).
Fourth, the Support Vector Machine (SVM) is a technique of classification that uses a small quantity of data to its best [15]. It is among the vectors belonging to a particular group or category and among those not belonging to the group.
Suppose, for example, two tags are available: costly and cheap, and the data contains two characteristics: and . It should be up to you to select which coordinates are more expensive and which are cheaper for each coordinate pair ( , ). In order to accomplish so, the SVM is to divide the two points, the so called border of decision, and, on the one hand, the group is so costly, and we cannot, on the other hand, reduce our costs.
Fifth, ensemble methods can create more than one model and then combine them to achieve better results [16]. Ensemble approaches are generally more precise than a single model [17]. This is also the case in several machine learning competitions, where the winning solutions are used in ensemble methods. The popular Netflix is ahead of the Competition, with the winner using a complex approach to implement a collaborative filtering algorithm. Here is the related code for this ensemble. The last is the decision tree. Decision tree is an algorithm of the supervised algorithm family of machine learning. It may be utilized both as a classification and regression problem [18]. The objective of the approach is to develop a model that predicts the value of a variable [19]. In order to resolve the problem of the leaf, the decision tree utilizes a tree representation to match a class label, and characteristics in the interior node of the tree are represented. The related code of decision tree as follows.

D. Evaluation Parameter
The methods or metrics we use to measure our project's evaluation are accuracy, precision, recall, and F1-score [20].
Precision predicts the percentage of positive reviews that use truly positive divided by the truly positive plus false positive as defined as in (3).
where is known as true positive and as false positive.
The recall measures the truly positive reviews divided by the total number of true positive and false positive reviews, as in (4).
where for true positive and for false negative F1 Score is the combination of both precision and recalls, as in (5).
Accuracy measures the system's performance, the true positive and true negative reviews divided by the total number of actual, false positive, and false negative reviews, as in (6).

III. Results and Discussion
We divide the dataset of 483148 reviews into 80% of the training set and 20% of the testing set. After successfully training machine learning models, we used test data set to predict the model and test for accuracy. When the project was completed, we decided it was a significant activity that enabled us to reach our goal and gave us much confidence. We have designed a machine learning model that will help predict user review sentiments. This system can predict with different models' accuracy, which is quite valuable. Then the accuracy results are given in Table 1.
The receiver operating curve (ROC) is a probability curve that indicates our binary classification based on the true and false-positive ratings. The area underneath the curve (AUC) is a metric of 0 to 1. The region underneath is the ROC curve. The ROC Curve of Ensemble Classification using TDIDF can be seen in Figure 4. The above curve is only for Ensemble Classification using TFIDF techniques, and we also perform the same task for every model using TFIDF and Count Vector. We perform the following tasks with every model. These tasks were also performed with TFIDF and also with Count Vectorizer. The result of the evaluation can be seen in Table 2.
From Table 2, our model is quite successful as it produces 89-90 or more than 90% accuracy on test data set with different models and techniques, but it does not mean it can consistently produce such highly accurate results. There is a possibility that it can produce false results to some extent and can produce completely false results in some exceptions case. Positive reviews predictions must lie between the range of 0.5 and less than 1 and false reviews ranges from 0 to 0.5 but from the figure below, some false prediction of positive reviews represented pessimistically, and some pessimistic predictions represented positive ones. So there are some deficiencies which need to be resolved in future works. The actual and predicted output can be seen in Figure 5. We live in a world of technology where artificial intelligence is a part of every system making it more autonomous and efficient. Nowadays, large ad networks and social or e-commerce businesses are implemented at a vast scale which uses targeted marketing and storing user data in a targeted manner by classifying user reviews in positive and negative using a system just like the system or algorithm we have developed using machine learning models. We also evaluated that combined or Ensemble machine learning models can produce more accurate and reasonable results than simple machine learning. At last, we compare all the models to check which model has the most fantastic accuracy, and our system is based on the GUI model, which performs the tasks in the following manners. The GUI model can be seen in Figure 6. The comparison results of the classification of all models in the system can be seen in Figure 7.

IV. Conclusion
In conclusion, as we used two methods for different models, TFIDF and Count Vector, we used them with all the algorithms we mentioned in the model part, including Naive Bayes, SVM, KNN, Decision Tree, Logistic Regression, and Ensemble Classification. As we can see from the results, we have better accuracy on the test set with the following algorithms, Multinomial, Ensemble, and SVM Logistic Regression on both types of features. The same approach may be expanded to many more classification methods and utilizing a Neural network to decide whether the best classification for opinion mining and sentiment analysis will be chosen. One of the main features of this project, which remains a problem, is Problems Extraction from reviews. If this work is done in the future, it will benefit the suppliers or the company.