Fraud Feedback Detection using Machine Learning Technique

: In the era of e-commerce, people often purchase the most of product from online shopping websites, customers rely on the number of reviews and stars of product to choose the right product for them. Fraud companies also pay to fraudsters to increase the rating of product, this tends to mislead the people and people end up with the product that they are not satisfied with. This can be stopped in some extent with the help of machine learning algorithms. The purposed method uses Machine Learning Algorithm like Random Forest, ANN, Support Vector Machine which helps the people to differentiate between fraud feedbacks and original reviews.


I. INTRODUCTION
Before the pandemic e-commerce was not so popular among people.People used to buy products & goods from the local stores, market, malls etc.But, when the pandemic starts, online shopping is the only easy way for people to fulfill their needs because the restrictions.During the pandemic, e-commerce has become the conventional way for people to buy a product as online shopping grows in popularity shoppers continue to depend on online review the for the authenticity of product.3 out of 4 people trust online reviews as much as personal recommendation.
As we know, reviews can easily be faked.We have all heard the phrase "Fake it till you make it" and yes, sometime it can be good advice, but unfortunately some businesses have taken that mantra in negative way.Companies pay peoples to write fraud feedbacks when they have not bought the product.Many people trust these fraud feedbacks and end up with the product that they are not satisfied with.As a result, people are losing trust in reviews.
Filtering out fraud feedbacks helps to maintain the authenticity of product which also helps to the genuine owners.Using machine learning & deep learning we can make models that can detect the fraud feedbacks

II. LITERATURE SURVEY 2.1 Methods
There are many ways for fraud feedbacks detection.From literature surveys many researchers have used supervised machine learning or unsupervised machine leaning but due to unlabeled dataset, it is difficult to remove fraud feedbacks in larger dataset by using machine learning technique.This problem can be solved using algorithms like Support vector machine, Random Forest, Naive bayes, Artificial neural network.

Approaches
According to feature framework for detecting fraud feedbacks that have been evaluated in the consumer electronics domain.The steps followed for detecting fake re-views in the electronic domain are 1.Acquisition of datasets 2. Feature framework definition for fraud feedback 3. Fraud feedback classification 4. Evaluation and analysis of the result.This framework is composed of two types of features.The features related to text review are Review centric and features that show the behavior of a user on the site are Centric features.
The survey developed a supervised machine learning approach to classify the online fraud feedbacks and filter out those reviews by using the datasets that contain hotel reviews.Also merged some features from the other research works for the development of a feature set that can perform better classification.
The paper compares the reviews and performance of several experiments done on the Yelp dataset of restaurants by using different classifiers like KNN, Naïve Bayes (NB), Support Vector Machine (SVM), Logistic Regression, and Random forest.In this paper, the Yelp dataset is used to evaluate the proposed approach.In the research paper, detection of fraud feedbacks from the restaurant online re-views.This detection of fake reviews from the restaurant dataset is done by using a hybrid approach and by following data pre-processing, detection, and ensemble learning.
III. PROPOSED METHODOLOGY Among the three types of machine learning, supervised machine learning is the most effective for the fraud feedback detection.Although deep learning methods are also much effective, they require largely sufficient training data, so they are notsuitable for fraud feedback detection due to the difficulty of labeling data manually.We have chosen Random Forest, SVM, ANN to perform best and to get maximum accuracy.The reason for choosing ensemble modeling is that a single classifier cannot perform best for all datasets provided.

Classifier Ensemble Approach
Ensemble learning methods are popular and widely used as they reduce bias and achieve better performance.An Ensemble strategy combines the prediction of two or more other models.An ensemble can make better predictions than any single con-tributing model.An ensemble reduces the spread or dispersion of the predictions and model performance.A classifier is built based on the identified features.And those features are assigned to the algorithms depending on the classified training sets.In this section, we briefly discuss the several proposed algorithms on fraud feedback detection, including classification, stop word removal, and stemming.

A. Support Vector Machine
For both types of problems that is regression and classification Support vector machine is a supervised machine learning algorithm that can be used effectively.SVM is used to find a hyperplane in an N-dimensional space that distinctly classifies the data points.The distance between a data point and hyperplane is margin and a decision plane that separates between a set of data points having different classes is called a hyperplane.Support vectors are the hyperplane used to maximize the margin between two classes.A hyperplane having the maximum distance to the nearest training data item of both the classes is considered to be the most appropriate hyperplane.
SVM is performed with the help of kernels.Kernels are very helpful in classification and are used to analyze the patterns in a given dataset.For the fraud feedbackdetection, the linear kernel will give better accuracy as it falls in the category of text classification problems.The 2nd most commonly used kernel is the gaussian Radial Basis Function (RBF).It has two hyper-parameters cost (C) and gamma that decide the performance of the SVM model.

B. Random Forest
Another algorithm that is used for filtering out fraud feedbacks is Random Forest (RF) which is also a supervised machine learning algorithm used for classification and regression.Selected features are given as input to the RF algorithm and it creates several probable classified decision trees with weight and label.It is a collection of multiple decision trees and is also less sensitive to the training data.The main reason to use the Random Forest algorithm over decision trees is that decision trees are highly sensitive to the training data which could result in high variance.Two techniques are present in Random Forest, bootstrapping and random feature selection.The idea behind bootstrapping is to create several subsets of data from a training set chosen randomly with replacement.On each bootstrapped dataset a decision tree is trained independently.Subsets of features are randomly chosen and used for training.The process in which the results are combined from multiple models is called aggregation.The bootstrapping and aggregation together are called bagging.Bootstrapping ensures that the same data is not used for every tree so in a way it helps our model to be less sensitive to the original training dataset.The random feature selection helps to reduce the correlation between trees.

C. Artificial Neural Network
An Artificial Neural Network is simply called a neural network that the behavior of the human brain to solve complex data-driven problems.ANN consists of 3 layers viz.input layer, hidden layer, and output layer as shown in Fig1.The feature vectors will be provided as an input to the input layer.The hidden layer connects both the input layer and output layer that consists of neurons and also in this layer all calculations are performed to find the hidden features and patterns.And the output layer will give a neural network response.From Fig. 1, the arrows drawn from the input to the hidden layer as well as from the hidden to the output layer indicate the pathway for the flow of information.Each ar-row represents a connection between two neurons.The connections have the weight, an integer value that controls the signal between two neurons.If the model is giving poor/undesired output then the model alters the weights to improve the desired output.However, if the model is generating good/desired output then there is no need to ad-just the weights between connections.

Proposed Architecture
First of all, data acquisition will take place i.e., the reviews submitted by the users to the website will get stored in the data set.Then that data set will go through pre-processing which consists of lexical analysis, stop word removal, and stemming.In lexical analysis, the review gets separated into tokens concerning delimiters such as space, newline, tab.In the stop word removal process, the words that occur commonly and frequently like a, an, the, of, etc. will be ignored.Now, in the Stemming process, reducing a word to its word stem takes place.In short, the system will replace all the variants with a single stem word.These variants include plurals, gerund forms i.e.,'Ing' form, third-person suffixes like 's', 'es', and past tense suffixes like 'ed'.
The next implementation step is normalization where a count vectorizer will be ap-plied on pre-processed data followed by cross-validation.Subsequently, the data is split into training and testing data sets.And afterward, feature extraction, the process in which the numbers of features are reduced by creating new features from existing features, is used on both training as well as testing data sets.These extracted features have both positive and negative values.With the help of a hybrid approach and by using supervised machine learning algorithms viz.Support Vector Machine (SVM), Artificial Neural Network (ANN), and Random Forest (RF), all the reviews will be classified as genuine or deceptive.Lastly, we analyze the accuracy of the proposed system and compare the highest acquired accuracy with other existing systems.The major drawback of today's existing systems is the low accuracy obtained by the models.To increase the accuracy of the classification of fraud feedbacks, we proposed a sentiment classification approach based on product aspects.The modules for this include Preprocessing Module, Product Aspect Identification, Sentiment Classification.The proposed system describes a personalized-based fraud feedbacks detection from social media.In the initial research, the system evaluates the system performance with this dataset and measure the accuracy.By using this proposed system, a more accurate result can be generated which will directly help make the process easier and less straining for the eCommerce websites.Also, customers can trust the authenticity of the reviews on that particular website without any hesitation.

Figure 1 :
Figure 1: Multilayered artificial neuron network [9]From Fig.1, the arrows drawn from the input to the hidden layer as well as from the hidden to the output layer indicate the pathway for the flow of information.Each ar-row represents a connection between two neurons.The connections have the weight, an integer value that controls the signal between two neurons.If the model is giving poor/undesired output then the model alters the weights to improve the desired output.However, if the model is generating good/desired output then there is no need to ad-just the weights between connections.

Figure 2 :
Figure 2: Architecture Diagram of Online Fake Review Detection System