SENTIMENT ANALYSIS OF ELECTRONIC PRODUCT TWEETS USING BIG DATA FRAMEWORK

Nowadays, social media has become more popular due to the advancement of Internet technologies and smartphone devices. Such platforms have generated interest among users to give their opinion. Social media-like Twitteralso plays an important role for business companies. Based on customer opinion about any product, business companies came to know more about customer choices. In the current scenario, millions of tweets are generated by people every year. But handling these huge unstructured tweets is not possible through the traditional platform. Therefore, big data framework, such as Hadoop and Spark, is used to handle such kind of large data. In this paper, different sale tweets are used to analyze the sentiments of customers regarding electronic products. The experimental results of the proposed work will be useful for various business companies to take business decisions, which will further enhance the product sales.


INTRODUCTION
Social media platforms, such as Twitter, Facebook and Instagram, have become vital constituents of daily life.People use these media to express their feelings, opinions, expressions, views and experiences about places or things [1].Sentiment analysis is used to classify public opinion towards a particular topic or product.Various prominent categories of sentiment analysis, such as machinelearning [2], lexicon-based [3] and hybrid [4] categories, are worked upon.A progressive practice has grown to draw out the information from data available on social networks.This data has huge potential and can be harnessed for business-driven application [5], such as movie review [6], product advertisement, public election [9], brand endorsement and many more.
For real-time data analysis, Twitter is the rational choice due to a large amount of relevant data, compact and concise tweets up to 280 characters and simplicity to post an opinion.Real-time tweets are collected using hashtags (like #iphone, #OppoF9Pro).Opinion mining [7] approach was used to find polarity of tweets such as positive, negative and neutral.Knowing the collective sentimental affinity could help companies transform their strategies [5].
For many years, the problem of sentiment analysis has been studied and proposed solutions suffer from certain disadvantages.Constant problems with these approaches were centralized environment and time-consuming techniques, which scare many computational resources [8].Furthermore, these standard approaches work on limited tweets and are not able to handle large size of tweets.Dubey et al. [9] proposed opinion-lexical approach in R platform to get insight about public opinion on political diplomats.However, the proposed approach works on a small dataset of approx.3000 tweets.So, for enhancing the capability to handle a large number of tweets, we require distributed or parallel processing techniques, such as Spark.
Al-Saqqa et al. [10] collected 4 million Amazon customers' review dataset for large-scale sentiment analysis under Apache Spark framework.The dataset was tested for supervised machine-learning algorithm, where the model was trained using labeled training set.It applied classification techniques, where support vector outperforms Naïve Bayes and logistic regression, attaining an accuracy of 86%.
In the age of Internet with such massive data, there is a need for faster computing and distributed storage, leading to a framework like Apache Spark, Apache Hadoop and Map Reduce techniques.Spark has emerged as the most popular big data processing engine.It improves over its predecessor, i.e., Hadoop MapReduce.MapReduce provides a simple model for writing programs that could execute in parallel in cluster.Spark improves MapReduce in three ways.Firstly, Spark engine can execute more general Directed Acyclic Graph (DAG) of operators than the rigid map-then-reduce format of MapReduce.Secondly, it has a rich set of transformation, which enables the output of one operation directly fed into another operation.Lastly, Spark extends with in-memory processing.Developers can instruct to cache any point in a processing pipeline, so future operations that need same data don't require to reload or recompute.It can be launched as a stand-alone or on cluster modes like Hadoop YARN, Apache Mesos and Kubernetes.It can integrate with distributed storage, such as HDFS, HBase and Cassandra.It is fast, much easy to use because of its high-level APIs in Java, Scala, Python and R. It has libraries, like MLlib for machine learning in Big data, GraphX for graph processing, Spark SQL and Spark Streaming [11].
In this paper, we do not propose any sentiment-prediction technique, but our aim is to analyze the eminent techniques regarding electronic products.We aim to perform sentiment analysis of data collected from Twitter using flume.These tweets are classified based on supervised learning approaches, such as Naive Bayes, SVM, Decision Tree, Random Forest and Logistic Regression classifier.
The remainder of the paper is arranged in the following manner.Section 2 represents related work.Section 3 is regarding big data processing using MapReduce, Spark and MLlib.Classification techniques are shown in section 4. In section 5, we present the sentiment analysis framework.Moreover, section 6 demonstrates the comprehensive experimental results.Conclusion and future work are presented in section 7.

RELATED WORK
Semantic analysis is the investigation of people's opinions, beliefs, attitudes and emotions towards an entity, such as products, services, events, issues and topics [1].It is the field of machine learning which has gained the attention of researchers since the beginning of the century.Miller et al. [12] introduces WordNet, an online database for English language semantic processing using synonym sets (synsets) relationship.SentiWordNet [13] is an advancement of WordNet as a tool for knowledgebased word level processing via building a dictionary to find a score of each word.[16] operated on a word granularity by using initially some seed words and using them to create a net; they proceeded further to sentence level by combining the strengths of the words, as they classify people's opinions.Moreover, Wilson et al. [17] operated on a phrase level, by running a supervised learning approach to determine the polarity or neutrality of phrases.Furthermore, document granularity [18] used word frequency and part of speech approach on Amazon reviews in categories, like books, DVDs, electronic and kitchen appliances to evaluate the response of people about the products.

Kim and Hovy
Twitter streaming API1 was used to gather data for product sentiment analysis [3].The aim of using twitter data is to understand public opinion.Around 60,000 tweets were collected using Twitter API to analyze customer opinions on widely used smartphones in Korea [21].Kumar et al. excavated opinions of the people about the quality of services provided by Airtel company [22].For this purpose, they collected 80,000 tweets using the hashtag "#Airtel".They assessed them using Naïve Bayes approach with an accuracy of 80.9% on Mahout installed over Hadoop to classify them into different classes.They used term frequency and inverse document frequency for internal processing.
Various techniques, such as machine learning [2], entropy-based [24] and tree-kernel [25] techniques, are used for Twitter sentiment analysis.The hybrid algorithms presented in [26] for Twitter feed classification improve accuracy when compared with similar techniques.To increase accuracy, word sequence disambiguation [15] and negation handling [16] could be used.In [27], the authors mined tweets with emoticons and punctuations.They concluded that Naïve Bayes performance and accuracy are higher than those of SVM.Emoticons and hashtags [28] are employed as sentiment labels to carry out KNN classification of diverse sentiment types.Kaur et al. [28] have used Spark for processing large data.They have also used Bloom filter for inspecting element membership in any proposed set and space compaction.
Agarwal et al. [25] used unigram model to classify Twitter data into 4 classes: positive, negative, neutral and junk, where junk included tweets not understood by a human annotator.They investigated on tree kernel and feature-based models and reported that these models outperform the unigram baseline.They highlighted that for feature analysis, prominent features were a combination of the prior polarity of words and their parts-of-speech-tags.However, they used manually annotated Twitter data for the test.
Kaptein [29] studied what influence the tweets have on the reputation of the company.They explored the sentimental-bearing text (i.e.subjective text) for factual information to derive reputational polarity.For example, Nokia Smartphone blasted while charging has a negative reputation for Nokia Company.They suggested that developing a polarity lexicon for the specific domain will be cost-beneficial.
In [10], the authors retrieved 4 million tweets, which required bulk processing speed and distributed storage, signifying the need for Big Data frameworks, like Hadoop and Spark.These frameworks are required to meet up the shooting data generation demand.Many researchers are using similar frameworks for tweet analysis [30].Baltas et al. [31] has used Twitter data with Spark platform.In the proposed approach, they have used binary and ternary classification.The result of F-measure of feature vector of logistic regression indicated 62.8% positive, 59.2% negative and 54.2% neutral.Chan and Thein [32] used sentiment analysis on 60k real-time tweets using Apache Flume on iphone mobile product.The results show that linear SVM performs better than NB by 10 % and better than logistic regression by 2%.
Earlier studies have shown that the traditional approach is suitable for limited data only.But, if we have a large amount of real-time tweets, we can't process them with normal architecture and traditional approaches.Therefore, it is high time to develop a framework with distributed processing to improve accuracy and performance of the models.So, in this paper, we are working with Spark framework and have used Flume for fast data retrieval.We have demonstrated the results of semantic analyzers and their machine learning validation is shown in tabular formats and graphs to render a complete picture about accuracy gained.We have not formulated any semantic prediction technique, but have analyzed SVM, NB, logistic regression, decision tree and random forest techniques on unstructured real-time electronic product tweets using Big Data framework.We have attained the average accuracy of 91% in logistic regression that is outperforming all the competing techniques.

BIG DATA PROCESSING
Big data deals with large datasets which require complex processing and need huge storage.Big data frameworks are listed below.

Hadoop
Hadoop software library is an open source implementation of the MapReduce framework.It enables distributed and parallel processing of large datasets.It also provides distributed storage on cluster of computers [33].Hadoop core contains MapReduce and Hadoop Distributed File System (HDFS).HDFS is responsible for storing large datasets on the cluster, which are partitioned into blocks and distributed into nodes.

MapReduce
MapReduce model allows distributed processing across multiple nodes in a cluster.It contains a map and a reduce function procedure, called mapper and reducer, respectively [34].Input data is partitioned into the mapper phase and transferred to workers to execute the map function; each worker output is in key-value pairs after processing the data.Shuffle phase sorts the output and groups it by key.Reducer calls for every unique key and gets a set of values associated with key.MapReduce framework deals with the underlying parallelization, adjustment to internal failure, information distribution between nodes and load adjustment.Data is replicated and distributed across nodes to improve both accessibility and reliability.

Spark Framework
Apache Spark2 is a fast and general framework for large-scale data processing.It is the improvement of Hadoop framework.Hadoop is ideal for large batch processing when we require to go through all data.However, its performance drops quickly for certain scenarios, e.g. when we have to deal with graph-based or iterative algorithms.Hadoop does not cache intermediate results but instead, it flushes the data to the dish in between each step.In contrast, Spark has a Directed Acyclic Graph (DAG) execution engine that allows cyclic data flow and in-memory computing.So, it can execute programs up to 100x times faster than Hadoop.It contains a set of libraries which combines streaming, SQL, graph processing and machine learning in a single engine.It provides many high-level APIs in Python, Scala, java and R and can run on Hadoop or standalone while using different data sources, such as, HDFS, Cassandra or HBase.It provides a programming model that hides the partitioning of dataset in cluster, using a new data structure called Resilient Distributed Dataset (RDD) [35].RDD is an immutable distributed collection of records partitioned into different nodes of the cluster.Data-sharing abstraction property of RDD allows to run a wide range of APIs provided by Spark: MLlib, Spark streaming, Spark SQL and GraphX (graph processing).By default, RDDs are short-lived, so if they are used in an action, they need to be recomputed.However, they can persist in memory for frequent reuse.

MLlib
MLlib is Spark's largest distributed learning library.It includes fast, scalable and easy implementation of common learning algorithms of machine learning, including classification, regression, clustering and collaborative filtering [36].The library also has low-level primitives for convex optimization, statistical analysis tools, distributed linear algebra and feature extraction and provides various I/O formats, such as LIBSVM format, Spark SQL data integration3 and MLlib's internal format.It shows excellent performance and scalability to handle larger problems.

CLASSIFICATION TECHNIQUES
This section describes sentiment analysis phases.The complete process of sentiment analysis is shown in Figure 1.The following supervised classification approaches are used to predict the polarity of a tweet.

Naïve Bayes
Naïve Bayes is an easy probabilistic classifier, which uses Bayes Theorem with an assumption of high (naïve) independence between features.It had proven effective in many application domains, like system performance management [37], text classification, medical diagnosis and many more.It assigns the most favourable class to a given instance according to its feature vector which is given by: where, X= (x1, x2, …, xn), indicating some independent feature vectors.CL : L possible outcomes (classes).X : Tweet needing to be classified.P (CL | X): Posterior probability.P(CL) and P(X) : Prior probabilities.

Support Vector Machine
Support Vector Machine carries out classification by searching for the hyperplane (boundary dividing one entity set from another) that maximizes the margin between two classes.Hyperplanes are explored using "important training tuples" (support vectors) along with margins [38].SVM can be implemented on both linear and non-linear datasets.SVM as a supervised learning classifier is popular due to its high reliability, varied application usage and less vulnerability to overfitted model [39].
Any hyperplane can be defined as P set of points satisfying where, W is normal vector to the hyperplane.

𝐵
|||| is the offset of the plane from the origin and normal vector W. We can plot multiple separating lines.We have to find the "best" line (least classification error), in general, best "hyperplane" by the maximum distance of the hyperplane to the closest negative instance and positive instance.Figure 2 shows SVM optimal hyperplane in training with sample tweets to classify positive tweets (star-shaped) and negative tweets (disk-shaped).

Decision Tree
Decision Tree is a flow-chart like structure, where each non-leaf node signifies test condition on the attribute; branches indicate the result of test and leaf node represents class label of entity set.First and topmost node is root node [25].Tree is explored from top to bottom indicating classification rules.It is a decision support tool which is used to display the outcome of test condition, resource cost, utility along with an algorithm that contains a statement of conditional control.
Decision tree can be converted into decision rules by association rules with target variable on righthand side.A decision tree can be used in temporal or causal relations [40].Figure 3 shows decision tree classification processing based on test condition.

Random Forest
Random forest classifier is a tree classifier which is generated using independently selected random vector from input dataset.Each tree for most favourable class casts one vote to classify input vector [41].It uses one or more combinations of features at every node to expand a tree.Bagging is a method to make training set via randomly drawing N replacement examples (N is the size of original training set used for feature selection).Every input instance can be classified by exploring most desirable voted class by all forest trees.We can use GINI index as a measure of attribute selection, which weights attribute impurity of all classes.For a given training dataset D, choosing one cast and ascertaining that it belongs to a class Ci, could be written as: where,

Logistic Regression
Logistic regression is a predictive classifier that is used to a model-dependent variable using logistic function.Dependent variable is a categorical value having two categories labelled as "0" and "1" like (loose or win, sick or not sick, true or false, tea or coffee).Independent variable is numerical or categorical value.It is used to classify observations, in terms of whether an observation belongs to a particular category or not (positive tweet or negative tweet in our problem).

Types of Logistic Regression:
 Binary Logistic Regression: models binary outcome (yes/no).
 Nominal Logistic Regression: models a multilevel outcome which is insensitive to ordering (choice of a transport mode such as bus, car, train).
Logit (log-odds) is a function which is equivalent to log odds of variables.If p is a probability of occurrence of an event (E= 1), then  1− represents the corresponding odds.Logit (E) is given by: A logistic curve is obtained by a logistic function.Logistic curve is just like a sigmoid curve the input of which is as any real value k (k € R), while the output value falls between (0, 1).Logistic curve is shown in Figure 4. Logistic function (k) is given by: where, p (k) is the probability of dependent variable. 0 : intercept from the linear regression equation.

SENTIMENT ANALYSIS FRAMEWORK
We present a framework for sentiment analysis which includes data collection, pre-processing, sentiment score calculation for tweets, classification and polarity prediction.

Data Collection Using Twitter API by Flume
Twitter is a corpus of 500 million published tweets by 321 million active monthly users 4 .This realtime data provides immense opportunities to study social trends.Crawling data from Twitter was collected using Flume.Flume links Flume agent with web servers.This is done with API keys extracted from Twitter developer's account.Twitter delivers Rest API and Streaming API to different client systems to absorb tweets.Figure 5 shows the process of data retrieval using Flume agent.Tweets are collected from source to channel and then from channel to HDFS sink.Different hashtags are used to collect live-stream data from Twitter.Description of used hashtags and collected tweets is shown in Table 1.Data extracted from Twitter using Twitter API comes in JSON format.Figure 6 is a snapshot of raw tweets in JSON format.However, JSON structure is not understood by user completely.Therefore, JSON Validator was used to validate data into a particular structure.Figure 7 shows the refined structural tweets after processing raw tweets in JSON format.

Pre-processing of Tweets
One of the major tasks of semantic analysis is data filtering.It helps improve the efficiency of the classifier.Following are the pre-processing steps:  Filteringwe eliminate useless parts of tweets, such as URL links, Twitter usernames, punctuations, hashtags, Twitter special words (such as "RT"), special characters and symbols. Stop words removal -some words, such as pronouns (he, she, it), articles (a, an, the), don't give any information for classification.Moreover, having these bags of words can lead to less accurate prediction.So, it's better to eliminate these stop words [43]. Stemmingit is a process of conversion of words in different forms into their single root word like "amuse", "amused", "amusement" and "amusing" have same root: "amus".Result of stemming is less intuitive to humans, but more comparable across observations.Stemming decreases entropy and increases relevance of root words like "amus" [43]. Feature extraction -Tokenization is a process of segmenting text by spaces and punctuation marks into tokens to form bags of words.Feature transformation function, like StringIndexer, OneHotEncoder and VectorIndexer, is used to transform categorical terms into vectors.TF-IDF is used to generate feature vectors from tweets.In TF-IDF, we compute TF (term frequency), which is the occurrence frequency of a term in that document and IDF (inverse document frequency) measuring how infrequent a word is present across all the document.TF-IDF shows relevancy of a word into a specific document.Spark MLlib library has HashingTF and IDF algorithms to calculate TF-IDF [44]. Figure 8 shows the execution of preprocessing steps.After completion of data filtering steps, we get refined tweets with their labels.A sample of tweets with their polarity is shown in Figure 9.
Figure 9. Sample of tweets with labels.

Tweet Score Calculation
This approach uses a standard list of positive and negative words to detect the polarity of a tweet.
Based on availability of positive or negative words within tweets, a sentiment score is generated.Polarity of a tweet, such as p(t) can be represented as {-1,0,1} referring to a negative, neutral and positive tweet, respectively [45].
A score of a tweet S(t) can be calculated as: where, p(i) is the polarity of term i in tweet.Polarity of a tweet can be determined as follows: 1, if St > 0 (positive) After score calculation for each tweet, we have training datasets with their polarities, such as positive, negative and neutral.

Model Implementation
ML is a dataframe package API, introduced in Spark 2.0.From start, spark framework has MLlib as an RDD-based API.To carry out the implementation in Spark, we need to follow some steps.
Firstly, import data into DataFrames.these are a distributed collection of data organized into named columns, which makes Spark programming easier and simpler to develop.
Transformer is an algorithm which can change one dataframe to another.
Thirdly, estimators are used to implement method fit(), which accept dataframe and make a model, such as logistic regression, Naïve Bayes, random forest, linear SVM and decision tree.

val Estimator = new LinearSVC() val Estimator = new NaiveBayes().setLabelCol("label").setFeaturesCol("features") val Estimator = new LogisticRegression()
Lastly, to combine ML algorithms into a single pipeline, we use Spark ML standardize APIs.Pipeline chains multiple transformers and estimators together in order to specify an ML workflow.

val pipeline = new Pipeline().setStages(Array(labelIndexer, tokenizer, remover, hashingTF, idf, Estimator)) val model = pipeline.fit(training)
In this classification step, to train the model, 70% of the dataset is randomly selected for training and 30% for testing.

RESULTS AND DISCUSSION
This section describes the details of experiments conducted on the Spark framework.

Environment Description
We conducted experimental tests on Spark framework using a single node configuration.To achieve the desired performance, we have operated on Intel quad-core 3.0 GHz processor with a RAM of 8 GB and a storage capacity of 1 TB on Ubuntu 18.0.We have used three different types of dataset related to electronic products; i.e., mobile phones, laptops and televisions, corresponding to 100 K, 70K and 50K tweets.

Polarity of Datasets
In this section, we have a pictorial representation of polarity in relation with phone, laptop and television tweets.3 shows the confusion matrix, which is a specific table layout that allows visualization of the effectiveness of a model.

Comparison of Different Machine Learning Approaches
In this subsection, we have performed a series of tests using different machine learning classification approaches under the big data framework on our dataset.This comparison is carried out under different parameters.Figures 13 and 14 show the comparison of varied approaches in relation to training and prediction time on different datasets.
Figure 13 shows that for training the model, Naïve Bayes classifier takes less time related to all three categories.Similarly, to prepare the model, random forest classifier takes more time.It also informs that there is a direct relation between tweet size and training time; i.e., as tweet size increases, training time also increases.
Prediction time comparison using all approaches is shown in Figure 14.We can further conclude that logistic regression takes more prediction time in all three cases, while all the remaining approaches take approximately the same prediction time.Figure 15 shows accuracy comparison of all the approaches.This figure illustrates that logistic regression performs better for larger data sizes with an accuracy of 86% in the phone, 91% in the laptop and 91% in the television classes.
Another comparison measure is AUC (Area under the curve).The comparative result set value is shown in Table 7.It determines which approach best predicts the classes.Based on this view, Figure 16 shows that both SVM and logistic regression classification approaches perform good, compared to the other approaches.

CONCLUSION AND FUTURE WORK
In this paper, we analyze sentiments of different electronic product tweets.For this, real-time tweets are collected from the Twitter platform using different hashtags.Additionally, Flume was used to consume real-time tweets in big data framework.After pre-processing of collected tweets, sentimental analysis has been performed by different supervised classification approaches.The experimental results show that the logistic regression approach has higher accuracy for all used datasets.Sentimental analysis comparison was carried out on the basis of Accuracy, F-measure and AUC.
Due to enhancement and popularity of social media platforms, such comparative results are more useful for business companies.They can easily help identify people's sentiment towards any specific electronic product or item.Based on sentiments, various decisions can be made.
In our future work, we intend to work on multiclass approaches to identify the exact polarity of tweets instead of positive, negative and neutral.In addition, we will work to enhance the accuracy of the approaches under big data technologies.
1  : Regression coefficient multiplied by some predictor value. : Base e indicates the exponential function.

Figure 6 .
Figure 6.Sample of raw tweets in JSON format collected from Twitter.

Figure 7 .
Figure 7. Sample of tweets in structured format.

Figures 10 ,
11 and 12  show the polarity of datasets indicating the ratio of positive, neutral and negative tweets, respectively.