Developing a supervised learning-based social media business sentiment index

Lee, Hyeonseo; Lee, Nakyeong; Seo, Harim; Song, Min

doi:10.1007/s11227-018-02737-x

Developing a supervised learning-based social media business sentiment index

Published: 10 January 2019

Volume 76, pages 3882–3897, (2020)
Cite this article

Download PDF

The Journal of Supercomputing Aims and scope Submit manuscript

Developing a supervised learning-based social media business sentiment index

Download PDF

Hyeonseo Lee¹,
Nakyeong Lee¹,
Harim Seo¹ &
…
Min Song ORCID: orcid.org/0000-0002-9558-8411¹

1838 Accesses
8 Citations
Explore all metrics

Abstract

The fast-growing digital data generation leads to the emergence of the era of big data, which become particularly more valuable because approximately 70% of the collected data in the world comes from social media. Thus, the investigation of online social network services is of paramount importance. In this paper, we use the sentiment analysis, which detects attitudes and emotions toward issues of society posted in social media, to understand the actual economic situation. To this end, two steps are suggested. In the first step, after training the sentiment classifiers with several big data sources of social media datasets, we consider three types of feature sets: feature vector, sequence vector and a combination of dictionary-based feature and sequence vectors. Then, the performance of six classifiers is assessed: MaxEnt-L1, C4.5 decision tree, SVM-kernel, Ada-boost, Naïve Bayes and MaxEnt. In the second step, we collect datasets that are relevant to several economic words that the public use to explicitly express their opinions. Finally, we use a vector auto-regression analysis to confirm our hypothesis. The results show the statistically significant relationship between public sentiment and economic performance. That is, “depression” and “unemployment” lead to KOSPI. Also, it shows that the extracted keywords from the sentiment analysis, such as “price,” “year-end-tax” and “budget deficit,” cause the exchange rates.

A survey on sentiment analysis methods, applications, and challenges

Article 07 February 2022

A review on sentiment analysis and emotion detection from text

Article 28 August 2021

Sentiment Analysis in Social Media Data for Depression Detection Using Artificial Intelligence: A Review

Article 19 November 2021

1 Introduction

Many factors affect the economic performance and financial market. Particularly, social media networks and electronic devices are rapidly increasing factors. According to the reports by Pew Research Center [1] and Statistica [2], the number of social media users has been doubled since 2010 and is expected to increase from 0.97 billion users in 2010 to 2.95 billion users in 2020. Furthermore, the amount of data produced is expected to reach 44 zettabytes in 2020 [3]. The analytics using big data, which is collected from online social media, provides useful insights into practical applications [4]. Particularly, the opinions of the public toward a particular issue using social networking sites have a certain effect on our society. The sentiment analysis enables one to capture these emotions by identifying subjective contents as positive, neutral and negative emotions to certain social issues. Hence, the algorithm is frequently used in numerous areas such as financial market and marketing for sales forecasting. Zhang et al. [5] demonstrate that the social media sentiment provides statistically significant information about the stock price.

In this study, we use sentiment analysis to investigate the relationship between economic performance and public opinions. For this goal, we train sentiment classifiers with big data sources collected from various social media sites including news articles, twitter and blogs. Then, we consider three different feature sets, which include feature vector and sequence vector with positive and negative word dictionaries, emoticons and lexical properties of the sequence of words. Finally, we evaluate the performance of six classifiers: MaxEnt-L1, C4.5 decision tree, SVM-kernel, Ada-boost, Naïve Bayes and maximum entropy. The results show that MaxEnt-L1 has sustainably better performance than other classifiers. The next step is to predict the sentiments of the collected datasets with the trained classifiers and compare the sentiment scores with an economic index. Finally, we use a VAR analysis and Granger causality theory to investigate the causal relationship between the sentiment scores and the economic performance. The contribution of this paper is not to propose a new method but to deeply analyze the correlation between the economic value and the time series emotion value of the social data collected using specific keywords. The remainder of the paper consists of related works, methodology, economic results and conclusion.

2 Related work

2.1 Sentiment analysis

The sentiment analysis finds how sentiments are expressed in certain texts and whether favorable or unfavorable sentiments or opinions can be assigned in the texts. In other words, the sentiment analysis includes sentiment expressions, polarity and strength of the expression and the relationship among the subjects [6]. As a topic of natural language processing (NLP) in the field of computer science, sentiment analysis has been studied in academic fields and industry. The primary purpose of applying sentiment analysis is to figure out how people feel about something. Furthermore, many companies, which have collected a lot of data on their customers and staffs, tend to use sentiment analyses to realize the reputation of their companies and make their business plan [7]. During the 2000s, sentiment analysis was developed in various areas. Pang and Lee [8] studied diverse theories and methodologies to approach sentiment analysis. Liu [9] mentioned that the major reason for the increase in sentiment analysis studies is the proliferation of social media. As noted in [8], advanced sentiment analysis methodology and applications are required to better understand customers.

Several works specifically proposed classifiers for sentiment analysis. According to [7], SA based on supervised learning is the most well-accepted approach for sentiment analysis. The common types of classifiers are Naïve Bayes classifier, decision tree, k-nearest neighbors, neural network, Support Vector Machine (SVM) and maximum entropy. Pang et al. [10] analyzed the performance of three classifiers (Naive Bayes, maximum entropy, and SVM) on movie reviews with rating indicators, i.e., a number of stars, which served as a baseline. They also used star ratings as polarity signals in their training datasets. Nasukawa and Yi [6] and Wilson et al. [11] classified the contextual polarity of sentiment expressions. Particularly, [11] classified expressions about specific items using manually devised patterns to categorize the polarity. O’Hare et al. [12] reported that Multinomial Naïve Bayes better performed than the Support Vector Machine (SVM) on finance-related blogs. In their study, the classification on sentiments is for both ternary (i.e., positive, negative and neutral) and binary (i.e., positive and negative). In classifying the sentiment of Twitter messages, [13] concluded that machine learning algorithms such as Naïve Bayes, Support Vector Machine (SVM), and maximum entropy achieved a high accuracy (more than 80%) using the trained Twitter message with emoticons. They applied feature extractors that consisted of unigrams, bigrams, unigrams and bigrams, and unigrams with part of speech tags. However, sentiment classification is often perceived as having the domain-dependent problem because there are different sentiment expressions in different domains, and the same word can mean different sentiments. Therefore, [14] suggested a collaborative multi-domain sentiment classification approach to simultaneously train sentiment classifiers for multiple domains. Specifically, they disassembled the sentiment classifiers as a global one and a domain-specific one. Fernández et al. [15] proposed the Distributional Correspondence Indexing (DCI) method for domain adaptation in sentiment classification. The experiment of [15] shows that the DCI performs well in comparison with the latest technologies for cross-language and cross-domain sentiment classifications. In addition, DCI substantially reduces the computational cost and requires less human intervention. Also, [16] creates a sentiment-related index (SRI) to evaluate the association between different lexical elements in a specific domain with the help of domain-independent features as a bridge in order to reconcile the gap between different domains. Then, they suggest a new SRI-based cross-domain sentiment classification algorithm called SentiRelated, to analyze the sentiment polarity of short texts. Furthermore, [17] explains innovative approach to predicting the sentiment of documents in multiple languages without translation through Latent Semantic Indexing (LSI) which is able to change over from multilingual corpus to a multilingual “concept space.” They invent and implement the experiments that examine the extent to which subjects and sentiment contribute individually to their classification accuracy. As a result, they try to straighten out the question of whether subjects and sentiment can be discerned sensibly.

2.2 Literature about using social media sentiment to business companies

Various businesses embrace text and sentiment analysis and combine it into their processes because of its efficiency and accuracy. Manek et al. [18] suggested a statistical method using weight by Gini Index method with Support Vector Machine (SVM) for feature selection in sentiment analysis by using large movie review datasets. Many companies recently used social media data such as Facebook and Twitter more frequently to interact with customers. Culnan et al. [19] introduced the Fortune 500’s use of four of the most popular social media platforms: Twitter, Facebook, blogs and client-hosted forums. They showed case studies of three Fortune 100 corporations to demonstrate how they administered their respective networks of social media. Generally, most activities using social media are sales, customer care, advertising, marketing, product development and innovation [20]. He et al. [21] conducted an in-depth case study, which used text mining, to analyze the instruction text content on Facebook and Twitter sites of the three largest pizza chains: Pizza Hut, Domino’s pizza and Papa John’s Pizza. Yu et al. [22] examined the effect of social media and traditional media, their relative importance and correlations to short-term firm stock market performance. They exercised advanced sentiment analysis techniques beyond the number of mentions to analyze the overall sentiment of each media resource on a daily basis toward a particular company.

3 Methodology

Figure 1 shows an overview of this study. As shown, the process can be divided into two parts. The first part trains the sentiment classifiers and calculates the performance of their sentiment classification for six types of classifiers. The second part finds which economic indicators precede or follow with the sentiment score from each source of contents. After checking the rejected hypothesis at an alpha value of some variables, we ascertain whether there is an antecedent or an aftertaste among the variables. Then, a vector autoregressive analysis is used to find the time difference that the two variables show before and after. Therefore, the Granger causality test and vector autoregressive analysis were simultaneously performed in this study.

3.1 Collected data

To create a social media index that can be used to identify the public economy from social media data, we sought to index consumer responses to the welfare economy based on a simple frequency of economic keywords. We collected 28 words of Twitter, blogs and news for each medium. In detail, in this study, we considered 73,229 news articles, 860,445 NAVER blogs and 9,749,893 tweets from Twitter from January 1, 2014, to October 31, 2015. We consider the periods between 2014 and 2015 because the Sewol ferry disaster occurred in 2014, and the Middle East respiratory syndrome (MERS) virus was running rampant during 2015. When we collected data, the terms of economic situation and event-related words were collected as query terms as shown in Table 1.

Table 1 Economic terms

Full size table

The data crawling process is shown in Fig. 2. When a specific query or search term is inputted, the search page results are collected. Using an HTML parser, the URL list is generated. With the Web client requests, web pages are gathered. Using the HTML parser, we can extract the data contents.

Since the collected data are composed of a document unit, it must be cut into sentence units. We separated the sentence into tokens, which are semantic units, through the tokenization process, which removes whitespaces and measurement strings and divides the sentence into words. Lemmatization is a technique to group multiple forms of a single word into a single form. Stop word removal is the process of eliminating meaningless words such as articles, postpositions, prepositions and conjunctions. Morpheme analysis is the representation of the contents of words, phrases and paragraphs in the document as data that can be processed. It is possible to grasp the parts of the sentence morphemes and ultimately to understand the structure of sentences. This process is called part of speech (POS), which is a task of assigning parts of speech by processing words and assigning lexical categories to each word.

3.2 Selection of feature set

The following feature sets were fed into classifiers to predict the sentiments.

Positive and negative data were collected from various data sources in various manners, manually filtered and selected. We used positive and negative words that are circulating in the public. Word2vec was used to select candidate words as positive or negative and manually selected. The profanity data were added to the text by the Korea Creative Content Agency and divided into positive and negative.

(1)
Feature vector (including Korean positive word dictionary (11,461 words), Korean negative word dictionary (13,767 words), curse word dictionary (3863 words), positive emoticon dictionary (49), negative emoticon dictionary (52), Korean SentiWordNet (105,178 words)
(2)
Sequence vector (bag of words)–TokenSequence2FeatureSequence
(3)
Combine dictionary-based feature vector + bag of words

We compiled the training dataset for sentiment classifiers as follows. To make the classifier domain-neutral, first we collected 11,000 tweets using the query “Seoul Mayor Election.” Then, we collected 6000 news articles using the query “living cost and job.” Finally, we collected 2,450,000 movie reviews from NAVER. Because of the sheer volume of review data, we decided to use the movie ratings of customers. The scale of rate is 0–10; we considered ratings of 0–3 as negative, 4–7 as neutral and 8–10 as positive reviews. These datasets (except for movie review data) were independently reviewed by three evaluators. They labeled each text as negative, neutral or positive.

Among 17,000 data instances, the three judges agreed on 3230 data instances as positive, 5021 instances as neutral and 5410 instances as negative. The percentage of agreement is (3230 + 5021 + 5410)/17,000 = 80.3%. Then, we used 13,661 data instances and 2.45 million movie reviews as the training data to learn the classifiers.

3.3 Machine learning algorithms

In this paper, we concentrate on selecting a correct classifier based on various feature set generation methods. Therefore, we apply six types of machine learning-based classification algorithms for evaluation: MaxEnt-L1, decision tree, SVM-kernel, Ada-boost, Naïve Bayes, and MaxEnt. MaxEnt, which is Max Entropy, is a probabilistic classifier and a type of exponential model that finds the probability distribution of maximum entropy [23]. MaxEnt is based on the principle of maximum entropy and can be applied to language detection, topic classification and sentiment analysis. Because we contribute to the performance of MaxEnt, we use MaxEnt-L1. According to [24], the MaxEnt model is a one-to-one relationship between subsets of variables that emerge from the parameterized factors of the model and subsets of variables to use in constraints. MaxEnt-L1, which adapts generalized expectation criteria for semi-supervised learning, has the flexibility to break out the one-to-one relationship because the generalized expectation criteria are defined from the model that contains generalized expectation terms. In addition, generalized expectation criteria have many advantages such as the ease of use and simplicity [25]. The generalized expectation criteria do not need to have an additional process such as making an inverted index for pre-clustering unlabeled data. In this regard, we add MaxEnt-L1 to evaluate the measures. We also use the C4.5 decision tree classifier to approximate discrete valued functions using a decision tree; the C4.5 decision tree classifier is the most popular among inductive inference algorithms [26]. As another classifier, we use Ada-boost, which is fast and simple to program [27]. In addition, Ada-boost does not require prior knowledge about the base learner, so it can be combined with any other method to find the base classifiers. We also use Naïve Bayes, which is a probabilistic classifier based on Bayes theorem [28]. Using training data, Naïve Bayes predicts the category of documents using cue words that occur in the classified target document. Finally, we use the SVM [29], which can find a hyperplane divided by the maximal margin in the positive and negative subsets.

As evaluation measures of these classifier, there are four indicators: accuracy, recall, precision and F-measure. First, the accuracy represents the ratio of correct classification in the total classifications. Recall is the number of assigned proper classifications divided by the number of assigned total exact categories. Precision is the portion of correct categorizations in the total classification. The F-measure indicates the combination of precision and recall.

3.4 VAR analysis

In this section, we use a VAR analysis to identify the relationship between financial data such as KOSPI and the exchange rate among social media sentiments. Vector auto-regression (VAR) is a type of random process that enables one to detect the linear interdependencies among multiple time-series data. A VAR model describes how k variables evolve over time using their past values as follows.

A pth order VAR, which is denoted by VAR(p), is:

$$x_{t} = \alpha + A_{1} x_{t - 1} + A_{2} x_{t - 2} + \cdots + A_{p} x_{t - p} + u_{t}$$

where x_t–j is the pth lag of x, α is a vector of constants, and u_t is an error term that satisfies $E(u_{t} ) = 0,\;E(u_{t} ,u_{s} ) = \varOmega$ and $E(u_{t} ,u'_{t - p} ) = 0$ where Ω is the covariance matrix of error terms.

The Korea Composite Stock Price index (KOSPI), which was first introduced in 1983 with the base value of 100, is computed from the prices of selected stocks using a weighted average. Levin and Zerovs [30] find that stock market predicts economic growth consistently. Hence, KOSPI can be used as an important indicator for economic activities.

3.5 Granger causality test

The fact that variable X is a Granger causality to variable Y implies that the fluctuation of the past X may affect the fluctuation of variable Y. Granger causality and the precedence between variable X and variable Y can be determined by performing Grandeur causality test with different time lags. Granger causality test can be selected by inputting only two time series. The time difference or delaying time is set to 1, 2, 3, 4, 5 days, etc. The p value, which determines the hypothesis test result according to the delay time, can be used to estimate the relative Granger causality between the two variables. In this study, the alpha value (α) was set to 0.1, 0.05 and 0.01. After finding the rejected hypothesis at an alpha value of some variables, first we confirm whether there is an antecedent or an aftertaste among the variables. Then, a vector autoregressive analysis is used to find the time difference that the two variables show before and after. Therefore, Granger causality test and the vector autoregressive analysis were simultaneously performed in this study.

4 Results

4.1 Performance results of the sentiment classification

The performance results of sentiment classification are suggested in Table 2. Three types of feature sets have the highest F − 1 in MaxEnt-L1: 0.7351, 0.7456, and 0.9296. When we use the vector feature set, the MaxEnt-L1 classifier indicates the highest accuracy (0.6787). In particular, when we combine the feature vector and bag of words, recall, precision, and F − 1 have the highest values in MaxEnt-L1. As a result, MaxEnt-L1 has better performance than five other classifiers.

Table 2 Sentiment classification performance results

Full size table

4.2 VAR analysis

4.2.1 VAR analysis with KOSPI

The fact that variable X is a Granger causality to variable Y implies that the fluctuation of the KOSPI and economic-related keywords such as “boom,” “depression” and “unemployment” were selected to investigate the relationship between the financial market and the sentiment scores using a VAR analysis. The VAR model is known as a successful technique to predict interrelated time-dependent variables, structural inference and policy analysis. In this study, we consider four endogenous variables for the VAR analysis: KOSPI, “boom,” “depression” and “unemployment.” Furthermore, we use Granger causality test to identify the causal relationship between the KOSPI and four other keywords selected from social media.

Before Granger causality test is applied, it is necessary to determine the optimal lag length because Granger methodology is sensitive to the lag length. From the results of Akaike information criterion (AIC), the 5-lag length is selected as an appropriate lag structure for the variables. Granger causality test procedure involves estimating the following series of regressions. Each variable in this system depends on its own lags and the lags of other variables.

$$Z_{t} = c + \sum\limits_{i = 1}^{5} {A_{i} Z_{t - i} + \epsilon_{t} }$$

(1)

where Z_t is an n × 1 vector variable. The vector of variables in the VAR is $Z_{t} = \, [\begin{array}{*{20}c} {y_{t} } & {b_{t} } & {d_{t} } & {u_{t} } \\ \end{array} ]^{\text{T}}$, which includes KOSPI (denoted by y), extracted keywords “boom,” “depression” and “unemployment,” which are denoted by b_t, d_t and u_t, respectively.

E(ϵ_t) = 0, E(ϵ_t, ϵ_s) = 0 for s ≠ t, and

$$E\left( {\epsilon_{t} ,\epsilon_{t}^{{\prime }} } \right) = \left[ {\begin{array}{*{20}l} {\sigma_{1}^{2} } \hfill & {\sigma_{1} \sigma_{2} } \hfill & {\sigma_{1} \sigma_{3} } \hfill & {\sigma_{1} \sigma_{4} } \hfill & {\sigma_{1} \sigma_{5} } \hfill \\ {\sigma_{1} \sigma_{2} } \hfill & {\sigma_{2}^{2} } \hfill & {\sigma_{2} \sigma_{3} } \hfill & {\sigma_{2} \sigma_{4} } \hfill & {\sigma_{2} \sigma_{5} } \hfill \\ {} \hfill & \vdots \hfill & \vdots \hfill & \vdots \hfill & {} \hfill \\ {\sigma_{1} \sigma_{5} } \hfill & {\sigma_{2} \sigma_{5} } \hfill & {\sigma_{3} \sigma_{5} } \hfill & {\sigma_{4} \sigma_{5} } \hfill & {\sigma_{5}^{2} } \hfill \\ \end{array} } \right].$$

The coefficients $A_{i} = \, [\begin{array}{*{20}c} {\beta_{1i} } & {\beta_{2i} } & {\beta_{3i} } & {\beta_{4i} } \\ \end{array} ]$ are constants to be estimated. The test results can be obtained from Eq. (1).

(i)
$$H_{{{\text{o}}(1)}} :\beta_{21} = \beta_{22} = \cdots \beta_{25} = 0.$$
(ii)
$$H_{{{\text{o}}(2)}} :\beta_{31} = \beta_{32} = \cdots \beta_{35} = 0.$$
(iii)
$$H_{{{\text{o}}(3)}} :\beta_{41} = \beta_{42} = \cdots \beta_{45} = 0.$$

The above hypotheses can be interpreted as follows: The test analyzes the null hypothesis that: (1) The keyword “boom” does not cause KOSPI, (2) “depression” does not cause KOSPI, and (3) “unemployment” does not cause KOSPI. Hence, the test results in Table 3 show that “depression” and “unemployment” lead to KOSPI, whereas KOSPI causes “boom” and “unemployment.” Consequently, there is a bi-directional causality in the short-run dynamics between KOSPI and “unemployment.” The results reveal uni-directional relationships between “depression” and KOSPI and between “unemployment” and KOSPI. If we reject the null hypothesis of (i), then we conclude that there is a causality from “boom” to KOSPI.

Table 3 Pair-wise Granger causality tests

Full size table

The outcome of Granger causality test to determine the interaction among KOSPI, “boom,” “depression” and “unemployment” for the specified period is shown in Table 3. The results show that both null hypotheses $\beta_{31} = \beta_{32} = \cdots \beta_{35} = 0$ and $\beta_{31} = \beta_{32} = \cdots \beta_{35} = 0$ are rejected. Consequently, “depression” and “unemployment” lead to KOSPI.

For each parameter estimate in Table 4, “boom” with lag 1 and lag 3 are statistically significant at the 10-percent level; “depression” with lag 2 and lag 4 are statistically significantly different from zero. Finally, “unemployment” at t − 1 and t − 2 have a statistically significant effect on the KOSPI. Hence, the selected keywords relating to economic terms such as “boom,” “depression” and “unemployment” with lags have a significant effect on the price of KOSPI. Furthermore, the coefficients of the KOSPI index with lag 3 are significantly different from zero.

Table 4 Vector auto-regression estimates

Full size table

Table 5 shows the results of AIC and BIC values that were used as a criterion for model selection. Given the results, we prefer the model with the lowest AIC or BIC value. Hence, we prefer the fifth lag with the lowest AIC or BIC.

Table 5 AIC and BIC values

Full size table

4.2.2 VAR analysis with exchange rates

In this study, we consider four endogenous variables: exchange rates, “price,” “year-end-tax” and “budget deficit.” Given the Akaike information criterion (AIC), we choose lag 2 for the optimal lag length.

The outcome of Granger causality test to determine the interaction among the exchange rate, “price,” “year-end-tax” and “budget deficit” for the specified period is indicated in Table 6. The results present that the extracted keywords from the sentiment analysis, such as “price,” “year-end-tax” and “budget deficit,” cause the exchange rates.

Table 6 Pair-wise Granger causality tests

Full size table

As shown in Table 7, the estimated coefficients of “price” and “year-end-tax” with lag 2 are statistically significantly different from zero at least at the 10% level. The lagged value of exchange rates significantly affects the “price.” Therefore, Granger causality runs one-way from price, “year-end tax” and “budget deficit” to exchange rate (Table 8).

Table 7 Vector auto-regression model estimates

Full size table

Table 8 AIC and BIC values

Full size table

Regarding the VAR analysis of exchange rates, we prefer the second lag that minimizes both AIC and BIC values. Hence, we determine the second lag for the VAR analysis.

5 Conclusion

On the economic side, sentiment analysis is a notably interesting field of research. In this study, we conducted experiments using six classifiers to analyze the sentiment of the public in social media related to several economic words. We combined the machine learning method, statistical analysis and Korean economy. Then, we investigated the relation among the sentiments from three types of media (i.e., news, Twitter and blogs) and actual economic indicators such as KOSPI and exchange rates by applying Granger causality test and vector auto-regression model. We found whether the sentiment scores derived from large-scale datasets were correlated with the economic index over time. The results show that MaxEnt-L1 surpasses other classifiers that we expect. In addition, we used a VAR analysis to investigate the relationship between the sentiment of the public and the actual economic situation related to the economic theme. We confirm that the sentiment of the public shown in some economic words is actually related to the economic situation. In other words, analyzing the public sentiment can result in meaningful economic forecasts or useful information in the enterprise. In fact, a company that analyzes and uses the public sentiment through social media has a stronger effect on operations [12, 14]. Therefore, it is expected that companies will be able to see good effects if they recognize the importance of public sentiment analysis and apply it to their marketing, customer service and operation methods. In future research, we plan to show the public sensibility related to economic keywords and the effect on the actual economic situation by comparing the economic index with the more in-depth emotion of the public. In addition, the effect on the actual economic situation should be demonstrated instead of the public sensibility related to only few economic keywords by comparing the economic index with the more in-depth emotion of the public.

References

Perrin A (2015) Social media usage. Pew research center, pp 52–68
Statista, Number of social network users worldwide from 2010 to 2021 (in billions). https://www.statista.com/statistics/278414/number-of-worldwide-social-network-users/
Jay Jacobs, CFA (2016) Social Media: Tech’s Growth Industry. https://www.globalxfunds.com/social-media-techs-growth-industry/
Jin S, Lin W, Yin H, Yang S, Li A, Deng B (2015) Community structure mining in big data social media networks with MapReduce. Clust Comput 18(3):999–1010
Article Google Scholar
Zhang G, Xu L, Xue Y (2017) Model and forecast stock market behavior integrating investor sentiment analysis and transaction data. Clust Comput 20(1):789–803
Article Google Scholar
Nasukawa T, Yi J (2003) Sentiment analysis: capturing favorability using natural language processing. In: Proceedings of the 2nd International Conference on Knowledge Capture. ACM, pp 70–77
Appel O, Chiclana F, Carter J (2015) Main concepts, state of the art and future research questions in sentiment analysis. Acta Polytech Hung 12(3):87–108
Google Scholar
Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1–2):1–135
Article Google Scholar
Liu B (2012) Sentiment analysis and opinion mining. Synth Lect Hum Lang Technol 5(1):1–167
Article Google Scholar
Pang B, Lee L, Vaithyanathan S (2002) Thumbs up?: Sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing-Volume 10. Association for Computational Linguistics, pp 79–86
Wilson T, Wiebe J, Hoffmann P (2005) Recognizing contextual polarity in phrase-level sentiment analysis. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp 347-354
O’Hare N, Davy M, Bermingham A, Ferguson P, Sheridan P, Gurrin C, Smeaton AF (2009) Topic-dependent sentiment analysis of financial blogs. In: Proceedings of the 1st International CIKM Workshop on Topic-Sentiment Analysis for Mass Opinion. ACM, pp 9–16
Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, vol 1, no 12
Wu F, Yuan Z, Huang Y (2017) Collaboratively training sentiment classifiers for multiple domains. IEEE Trans Knowl Data Eng 29(7):1370–1383
Article Google Scholar
Fernández AM, Esuli A, Sebastiani F (2016) Distributional correspondence indexing for cross-lingual and cross-domain sentiment classification. J Artif Intell Res 55(1):131–163
Article MathSciNet Google Scholar
Wang L, Niu J, Song H, Atiquzzaman M (2018) SentiRelated: a cross-domain sentiment classification algorithm for short texts through sentiment related index. J Netw Comput Appl 101:111–119
Article Google Scholar
Bader BW, Kegelmeyer WP, Chew PA (2011) Multilingual sentiment analysis using latent semantic indexing and machine learning. In: IEEE 11th International Conference on Data Mining Workshops, pp 45–52
Manek AS, Shenoy PD, Mohan MC, Venugopal KR (2017) Aspect term extraction for sentiment analysis in large movie reviews using Gini index feature selection method and SVM classifier. World Wide Web 20(2):135–154
Article Google Scholar
Culnan M, McHugh P, Zubillaga J (2010) How large U.S. companies can use twitter and other social media to gain business value. MIS Q Executive 9(4):243–259
Google Scholar
Di Gangi PM, Wasko M, Hooker RE (2010) Getting customers’ ideas to work for you: learning from dell how to succeed with online user innovation communities. MIS Q Executive 9(4):163–178
Google Scholar
He W, Zha S, Li L (2013) Social media competitive analysis and text mining: a case study in the pizza industry. Int J Inf Manag 33(3):464–472
Article Google Scholar
Yang Y, Duan W, Cao Q (2013) The impact of social and conventional media on firm equity value: a sentiment analysis approach. Decis Support Syst 55(4):919–926
Article Google Scholar
Phillips SJ, Anderson RP, Schapire RE (2006) Maximum entropy modeling of species geographic distributions. Ecol Model 190(3):231–259
Article Google Scholar
Sun CJ, Yao L, Lin L, Sha XJ, Wang XL (2011) Semi-supervised biomedical relation classification using generalized expectation criteria. In: 2011 International Conference on Machine Learning and Cybernetics (ICMLC), vol 4. IEEE, pp 1949–1952
Mann GS, McCallum A (2010) Generalized expectation criteria for semi-supervised learning with weakly labeled data. J Mach Learn Res 11:955–984
MathSciNet MATH Google Scholar
Polat K, Güneş S (2009) A novel hybrid intelligent method based on C4. 5 decision tree classifier and one-against-all approach for multi-class classification problems. Expert Syst Appl 36(2):1587–1592
Article Google Scholar
Schapire RE (2003) The boosting approach to machine learning: an overview. In: Nonlinear estimation and classification. Springer, New York, pp 149–171
Lewis DD (1998) Naive (Bayes) at forty: the independence assumption in information retrieval. In: European Conference on Machine Learning. Springer, Berlin, pp 4–15
Vapnik V (2013) The nature of statistical learning theory. Springer, Berlin
MATH Google Scholar
Levine R, Zervos S (1998) Stock markets, banks, and economic growth. Am Econ Rev 88:537–558
Google Scholar

Download references

Acknowledgements

This work was supported by the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF-2015S1A3A2046711).

Author information

Authors and Affiliations

Library and Information Science, Yonsei University, Seoul, South Korea
Hyeonseo Lee, Nakyeong Lee, Harim Seo & Min Song

Authors

Hyeonseo Lee
View author publications
You can also search for this author in PubMed Google Scholar
Nakyeong Lee
View author publications
You can also search for this author in PubMed Google Scholar
Harim Seo
View author publications
You can also search for this author in PubMed Google Scholar
Min Song
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Min Song.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, H., Lee, N., Seo, H. et al. Developing a supervised learning-based social media business sentiment index. J Supercomput 76, 3882–3897 (2020). https://doi.org/10.1007/s11227-018-02737-x

Download citation

Published: 10 January 2019
Issue Date: May 2020
DOI: https://doi.org/10.1007/s11227-018-02737-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Developing a supervised learning-based social media business sentiment index

Abstract

Similar content being viewed by others

A survey on sentiment analysis methods, applications, and challenges

A review on sentiment analysis and emotion detection from text

Sentiment Analysis in Social Media Data for Depression Detection Using Artificial Intelligence: A Review

1 Introduction

2 Related work

2.1 Sentiment analysis

2.2 Literature about using social media sentiment to business companies

3 Methodology

3.1 Collected data

3.2 Selection of feature set

3.3 Machine learning algorithms

3.4 VAR analysis

3.5 Granger causality test

4 Results

4.1 Performance results of the sentiment classification

4.2 VAR analysis

4.2.1 VAR analysis with KOSPI

4.2.2 VAR analysis with exchange rates

5 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Developing a supervised learning-based social media business sentiment index

Abstract

Similar content being viewed by others

A survey on sentiment analysis methods, applications, and challenges

A review on sentiment analysis and emotion detection from text

Sentiment Analysis in Social Media Data for Depression Detection Using Artificial Intelligence: A Review

1 Introduction

2 Related work

2.1 Sentiment analysis

2.2 Literature about using social media sentiment to business companies

3 Methodology

3.1 Collected data

3.2 Selection of feature set

3.3 Machine learning algorithms

3.4 VAR analysis

3.5 Granger causality test

4 Results

4.1 Performance results of the sentiment classification

4.2 VAR analysis

4.2.1 VAR analysis with KOSPI

4.2.2 VAR analysis with exchange rates

5 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation