Comparative Analysis of Online Rating Systems

Online rating systems serve as decision support tool for choosing the right transactions on the internet. Consumers usually rely on others’ experiences when do transaction on the internet, therefore their feedbacks are helpful in succeeding such transactions. One important form of such feedbacks is the product ratings. Most online rating systems have been proposed either by researchers or industry. But there is much debate about their accuracies and stability. This paper looks at the accuracy and stability of set of common online rating systems over dense and sparse datasets. To accomplish that we used three evaluation measures namely, Mean Absolute Errors (MAE), Mean Balanced Relative Error (MBRE) and Mean Inverse Balanced Relative Error (MIBRE), in addition to Borda count to assess the stability of ranking among various rating systems. The results showed that both median and Dirichlet are the most accurate models for both sparse and dense datasets, whereas the BetaDR model is the most stable model across different evaluation measures. Therefore we recommend using Dirichlet or BetaDR for the products with few number of ratings and using the median model with products of large number of ratings. Keywords—Online rating systems; reputation models; comparative analysis; decision making; e-commerce


INTRODUCTION
Online rating systems play a vital role in most ecommerce applications.They help users to facilitate their decisions while they perform internet transactions [1], [4].The online rating system is responsible for collecting, processing and aggregating ratings given for a specific product.The main challenge that faces the online rating systems is how to aggregate the collected ratings for a specific product in way that can reflect its real quality [13].In practice, most of the well-known ecommerce portals such as eBay, Amazon, etc. use their own methods to compute the quality of product.But some other portals use the simplest aggregation method which is the Naïve average methods (i.e.mean, median and mode).In contrast, many authors proposed different method to compute product score based on statistical and machine learning methods.The accuracy of such methods depends mainly on the user satisfaction about the results achieved [14].This satisfaction is difficult to be measured because most ecommerce application don't provide a tool to evaluate the user satisfaction, and whether the given aggregate rating help them in performing the successful transaction.The rating aggregation methods in literature can be divided into four groups, Naïve models, weighted average models, Fuzzy models and probabilistic models.The weighted average models are the widely used among researchers, where the weights are derived from historical user data or time factor.These weight values work as discount factors to reflect different aspects of users' behavior such as their reliability, trustworthiness and credibility in providing rating.One of the common problem that faces rating systems is unfair ratings that biases aggregate scores for some products.This paper attempts to look at the accuracy and stability of the most common online rating systems over dense and sparse datasets.Practically, not all methods perform well over dense or sparse datasets.This fact has been confirmed by almost all previous studies because each model attempts to treat a specific limitation in previous rating systems.To best of our knowledge, there is no systematic procedure has been conducted to compare and evaluate different online rating systems in terms of accuracy and stability.The proposed research questions are: RQ1: Is there any one method that can perform stably well under all conditions?RQ2: Which group of methods is more appropriate for dense datasets?RQ3: Which group of methods is more appropriate for sparse datasets?This paper is structured as follows: section 2 presents the literature and overview of existing online rating systems.Section 3 introduces the experimental methodology and comparison procedure.Section 4 presents the obtained results, finally we end up with the conclusions in section 5.

II. OVERVIEW
Online rating system receives ratings from users as input to compute the aggregate score of product.Given a set of users * + where each user rated at least one product, also given a set of products * + where each product received at least one rating, the intersection between user and the product is the rating such that .k is the maximum rating level for rating system.̅ is the ratings average of product , and ̅ is the average of all ratings in the dataset.Indeed, Naïve methods such as arithmetic mean (see Equation 1) and median are the most common used methods.Garcin et al. [15] compared between Naïve methods and other rating systems.They revealed that the median is the most accurate method.In contrast, other studies [8], [9] showed that the naïve methods are ineffective because they are easily influenced by unfair and malicious ratings and cannot discover trend emerging from recent ratings.
IMDb is another famous online rating system that uses true Bayesian estimation to calculate the aggregate product score as www.ijacsa.thesai.orgshown in (2).The exact implementation of this model is still unpublished in order to keep the policy effective.

̅ ∑
(1) Where n is the number of ratings received for product .MinR is the minimum number of rating count required to appear on the top 250.IMDb usually uses MinR=2500.
In literature, the weighted average models are the widely used models, where the weights are computed based on either time or user data.Josang and Haller [5] introduced the age of rating as discount factor in computing and aggregating rating, where old ratings receive less weight than recent ratings because they are not informative.The main problem with this model is which time unite (i.e.day, week, month, year) should be considered with this function.Another time discount function used is the number of past transactions instead of using the ratings age [10].Leberknight et al. [8] stated that the naïve methods are good when there is clear trend of ratings over time, but when the ratings do not have that trend one should involve the volatility of ratings as discount factor to compute the product score.They proposed discount function based on rating volatility, but they ignored the importance of other factors such as trustworthiness and credibility of users.On the other hand, many online rating systems use users' data to measure their reliability, credibility and trustworthiness and reflect that as weight during aggregation process [12].In this direction, Riggs et al. [11] defined the reliability of a user by his ability to provide rating that is very close to the current ratings average.They defined a measure to calculate that closeness and use it with their weight average model.Lauw et al. [7] studied the leniency of user while they rate products.They proposed a function that can calculate the leniency and strictness of user and reflect that as weight.They classified users into two classes (lenient and strict) based on leniency variable as shown in Equation 3, such that if then reviewer is strict, otherwise reviewer is lenient.This model is called LQ.

| | ∑ ( ) | |
(3) Where is the initial quality of the item j which is usually the average of ratings.
is the leniency of the reviewer.,is a compensation factor determined by expert.Abdel-Hafez et al. [1], [16] used Beta distribution function to compute ratings weights.Their model is called BetaDR.The product ratings should be first sorted from smallest to largest and scaled as shown in (5).The beta distribution function has advantage such that it can change its shape based on the rating distribution.Therefore they controlled the shape of the function by two variables and as shown in (6).Finally, the product score is measured as shown in (7). (5) ∑ ( ) Where is the gamma function, and and are Beta distribution parameters that are determined based on mean and distribution of ratings. is rating level (i.e. 1, 2, … k).
is the summation of normalized Beta weight for the target level.Jøsang et al. [6] introduced a reputation model based on Dirichlet probability distribution as shown in Equations 8 and 9.This model is a generalized form to their previous model and takes the rating counts in calculation.The model works well with good accuracy over sparse datasets because it involves factors that can treat uncertainty in the data.
where ⃑⃑⃑⃑ represents the score vector of each rating level, ( ) represents the probability that one agent gives rating i to agent y. is a constant value, and ( ) is the base rate, which equals to 1/k. ( ) is the number of ratings of the level i. Bharadwaj et al. [2] used the ordered weighted averaging method with fuzzy computation as part of their trust model to aggregate rating as shown in Equations 10 to 12.According to them, the reputation of a reviewer is defined as the accuracy of his prediction to other reviewer's ratings towards different items.Recently, Liu et al. [9] proposed several factors to identify unfair ratings.These factors are combined together using Fuzzy Logic System based on human predefined rules.The output of Fuzzy logic system is the discount weight of rating.12) www.ijacsa.thesai.org

A. Datasets
Most authors used public datasets to validate their models which allow them to generalize the extracted knowledge.In this paper we continue that approach to facilitate the replication studies in future.Two stable versions of MoviLens datasets have been used namely, 100K and 1M [3].Both datasets have large number of ratings which are considered dense datasets as shown n Table1.To compare online rating systems over sparse datasets, we extracted new three datasets from the original 1M dataset, where each new dataset contains randomly selected 4, 6 and 8 user ratings respectively.These datasets are called 1M4, 1M6 and 1M8.The characteristics of all datasets are shown in Table 1.

B. Evaluation measures
Evaluation measures are used to assess the accuracy and stability of online rating systems.To measure the accuracy of a model we used three measures, Mean Absolute Errors (MAE), Mean Balanced Relative Error (MBRE) and Mean Inverse Balanced Relative Error (MIBRE).These measures have been selected as they are not biased.The MAE assesses, for each product, the closeness of the generated score to the actual ratings for a product as shown in Equation 13.Both MBRE and MIBRE compute the relative accuracy of the generated scores as shown in Equations 14 and 15.

∑ ∑ | |
Where is the aggregated score for product p j .m is the number of products in the testing data.

C. Experimental procedure
As mentioned in the literature, there are many models have been proposed to aggregate online ratings.In this study we used eight state-of-art models are: Mean, Median, BetaDR [1], Bayesian [6], Dirichlet [5], IMDb, Fuzzy rating [2] and LQ [7].For comparison purpose we used 10-Fold cross validation.This procedure divides the dataset into 10 groups of training and testing data.Each group has 90% of the data as training data and 10% as testing data.The training data is used to build the online rating system, while the testing data is used to evaluate the model.The validation is running 10 times, one time for each group.In each run we record the MAE, MBRE and MIBRE for test ratings.The fundamental idea of using this validation technique is that a reputation score that is produced from training dataset is considered accurate if it is very close to actual ratings in the testing dataset.To measure the stability for each model across different evaluation measures, we rank all models according to their accuracy in terms of MAE, MBRE and MIBRE over all datasets.Then we run Borda count method over all datasets, dense datasets and sparse datasets respectively.Borda count is voting ranked method used to rank various candidates based on the ranks provided by voters.This method is simple and very common in decision making area.First we evaluate the stability of all models over all datasets across all evaluation measures.Then in the second round we evaluate the stability over only dense datasets, then finally over sparse datasets.In all cases the evaluation measure work as voters.

IV. RESULTS AND DISCUSSION
This section presents the results of comparisons among different online rating systems.Table 2 shows the MAE results over all datasets.From the results we can notice that the differences between all models are nearly negligible, except for LQ model where it is extreme over both dense and sparse datasets.It is interesting to know that Naïve models produce accurate results in comparison to more sophisticated models such as Bayesian and LQ.For the dense datasets (i.e.100K and 1M) the median model produces the more accurate results, while for sparse datasets the Dirichlet and BetaDR are more accurate.This results confirmed previous findings that confirm that both Dirichlet and BetaDR were originally proposed to handle sparse datasets that contain very few ratings.In spite of that, the median model still produces comparable accuracy to Dirichlet model over all sparse datasets.To perform further investigations, we run the analysis using MBRE and MIBRE evaluation measures.Table 3 shows the results of MBRE over all datasets.Similar to Table 2, the accuracy results are close.Generally, we can observe that the Dirichlet model is the most accurate model over both dense and sparse datasets.Table 4 suggests that the median model is the most accurate model over all datasets.This variation in the results confirm that both median and Dirichlet models are the most accurate models for both sparse and dense datasets.Based on above analysis we can recommend using the median model because it has simple implementation than Dirichlet and can www.ijacsa.thesai.orgproduce comparable to Dirichlet and better than many sophisticated models.To analyze the stability of all models over all datasets and both sparse and dense datasets, we first rank all models over each dataset individually and over each evaluation measure.Then we apply the Borda count method.Table 5 presents the ranking stability of all models over dense and sparse datasets.From the results of ranking we can notice that the BetaDR is the most stable model over all datasets and especially over dense datasets across different evaluation measures, whereas the Dirichlet model is the most accurate model over sparse datasets.Generally, we can notice that the top three models in the table (i.e.BetaDR, Dirichlet and median) are the most stable models.The results obtained surprisingly suggest that the BetaDR is better than both Dirichlet and median over all datasets.In contrast, we can observe that the sophisticated models such as Fuzzy and LQ are not accurate as they occupy the last position over all datasets and across all evaluation measures.Also the commonly used mean model occupies mid positions with unstable ranking across all evaluation measures.Ans.Similar to previous answer, we can observe that Dirichlet and BetaDR are the most accurate and stable models over sparse datasets.This is not surprising results because the purpose of construction of both models was to treat the sparse datasets.Also both models are good for new rating system that has few numbers of ratings.
V. CONCLUSIONS Online rating system is a helpful tool to facilitate user decision in conducting online transactions.However, the accurate rating system can let user choose the correct product which leads to better user satisfaction.Many models have been proposed in literature, but their accuracy are subject to the degree of helpfulness.In this paper we conducted a comparative analysis for the widely used online rating systems to investigate their accuracies and stability over dense and sparse datasets.Three evaluation measures in addition to Borda count method have been used to assess the stability and accuracy of the employed models.From the obtained results we found that both median and Dirichlet are the most accurate models over dense and sparse datasets respectively.Also we found that the BetaDR are most stable model across all evaluation measures.Finally, the Fuzzy and LQ were the worst models.From these results we can figure out that while the top three ranked models: median, BetaDR and Dirichlet produce relatively accurate and stable results we recommend using median because it has the simplest implementation among three models, and does not consume cost when running.On the other hand, we recommend to use the median model for products with many ratings and using the Dirichlet model when the products have few number of ratings.

TABLE II .
MEAN ABSOLUTE ERROR RESULTS

TABLE III .
MBRE RESULTS

TABLE V .
RANKING STABILITYAns.Actually, there is no accurate answer because the difference among all models are negligible, but we can say that median and Dirichlet models produce the most accurate results as shown in Tables2, 3 and 4.RQ2: Which group of methods is more appropriate for dense datasets?Ans.From Table5we can see that both BetaDR and median models are the most stable and accurate models over dense datasets.