Monitoring rare categories in sentiment and opinion analysis: a Milan mega event on Twitter platform

Calissano, Anna; Vantini, Simone; Arena, Marika

doi:10.1007/s10260-019-00504-7

Monitoring rare categories in sentiment and opinion analysis: a Milan mega event on Twitter platform

Original Paper
Published: 17 December 2019

Volume 29, pages 787–812, (2020)
Cite this article

Statistical Methods & Applications Aims and scope Submit manuscript

333 Accesses
1 Citation
Explore all metrics

Abstract

This paper proposes a new aggregated classification scheme aimed to support the implementation of semantic text analysis methods in contexts characterized by the presence of rare text categories. The proposed approach starts from the aggregate supervised text classifier developed by Hopkins and King and moves forward, relying on rare event sampling methods. In detail, it enables the analyst to enlarge the number of estimated sentiment categories, both preserving the estimation accuracy and reducing the working time to unconditionally increase the size of the training set. The approach is applied to study the daily evolution of the web reputation of one of the last mega-event taking place in Europe: Expo Milano. The corpus consists of more than one million tweets in both Italian and English, discussing about the event. The analysis provides an interesting portrayal of the evolution of the Expo stakeholders’ opinions over time and allows the identification of the main drivers of the Expo reputation. The algorithm will be implemented as a running option in the next release of the R package ReadMe.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Classification, Identification, and Analysis of Events on Twitter Through Data Mining

Sentiment Analysis for Tracking Breaking Events: A Case Study on Twitter

Text mining and determinants of sentiments: Twitter social media usage by traditional media houses in Uganda

Article Open access 10 April 2019

Frank Namugera, Ronald Wesonga & Peter Jehopio

References

Agosti M, Bacchin M, Ferro N, Melucci M (2002) Improving the automatic retrieval of text documents. In: Workshop of the cross-language evaluation forum for European Languages. Springer, pp 279–290
Aprosio AP, Moretti G (2016) Italy goes to stanford: a collection of corenlp modules for italian. arXiv preprint arXiv:1609.06204
Berger AL, Pietra VJD, Pietra SAD (1996) A maximum entropy approach to natural language processing. Comput Ling 22(1):39–71
Google Scholar
Blei DM (2012) Probabilistic topic models. Commun ACM 55(4):77–84
Google Scholar
Blei DM, Lafferty JD (2007) A correlated topic model of science. Ann Appl Stat 1(1):17–35
MathSciNet MATH Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Bouchet-Valat M (2014) SnowballC: snowball stemmers based on the C libstemmer UTF-8 library. R package version 0.5.1
Breiman L, Friedman J, Stone C, Olshen R (1984) Classification and regression trees. The Wadsworth and Brooks-Cole statistics-probability series. Chapman & Hall, New York
Google Scholar
Breslow NE (1996) Statistics in epidemiology: the case–control study. J Am Stat Assoc 91(433):14–28
MathSciNet MATH Google Scholar
Ceron A, Curini L, Iacus SM (2015) Using social media to forecast electoral results: a review of state-of-the-art. Stat Appl Ital J Appl Stat 25(3):239–261
Google Scholar
Ceron A, Curini L, Iacus SM (2016) isa: a fast, scalable and accurate algorithm for sentiment analysis of social media content. Inf Sci 367:105–124
Google Scholar
Chen H, Chiang RH, Storey VC (2012) Business intelligence and analytics: from big data to big impact. MIS Q 36(4):1165–1188
Google Scholar
Choi D, Kim P (2013) Sentiment analysis for tracking breaking events: a case study on twitter. Asian conference on intelligent information and database systems. Springer, Berlin, pp 285–294
Google Scholar
Corallo A, Fortunato L, Matera M, Alessi M, Camillò A, Chetta V, Giangreco E, Storelli D (2015) Sentiment analysis for government: an optimized approach. In: Perner P (ed) Machine learning and data mining in pattern recognition. Springer, Cham, pp 98–112
Google Scholar
da Silva NF, Hruschka ER, Hruschka ER (2014) Tweet sentiment analysis with classifier ensembles. Decis Support Syst 66:170–179
Google Scholar
Das SR, Chen MY (2007) Yahoo! for Amazon: sentiment extraction from small talk on the web. Manag Sci 53(9):1375–1388
Google Scholar
Dave K, Lawrence S, Pennock DM (2003) Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th international conference on World Wide Web. ACM, New York, WWW ’03, pp 519–528
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Science 41(6):391–407
Google Scholar
Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York
MATH Google Scholar
Erosheva E, Fienberg S, Lafferty J (2004) Mixed-membership models of scientific publications. Proc Natl Acad Sci 101(suppl 1):5220–5227
Google Scholar
ExpoMilano (2015) Expo Milano 2015: La sfida dell’italia per un’esplosione universale innovativa. www.expo2015.org
Feinerer I, Hornik K (2017) tm: Text Mining Package. R package version 0.7-3
Gentry J (2015) twitteR: R based Twitter Client. R package version 1.1.9
Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. Nature 1(12):1–6
Google Scholar
Grimmer J, Stewart BM (2013) Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Polit Anal 21(3):267–297
Google Scholar
Hand DJ (2006) Classifier technology and the illusion of progress. Stat Sci 21(1):1–14
MathSciNet MATH Google Scholar
Hopkins DJ, King G (2010) A method of automated nonparametric content analysis for social science. Am J Polit Sci 54(1):229–247
Google Scholar
Hopkins D, King G (2017) ReadMe: software for automated content analysis. R package version 0.99837
Inversini A, Marchiori E, Dedekind C, Cantoni L (2010) Applying a conceptual framework to analyze online reputation of tourism destinations. In: Gretzel U, Law R, Fuchs M (eds) Information and communication technologies in tourism 2010. Springer Vienna, Vienna, pp 321–332
Google Scholar
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Nédellec C, Rouveirol C (eds) Machine learning: ECML-98. Springer, Berlin, pp 137–142
Google Scholar
King G, Zeng L (2001) Logistic regression in rare events data. Polit Anal 9(2):137–163
Google Scholar
Laver M, Benoit K, Garry J (2003) Extracting policy positions from political texts using words as data. Am Polit Sci Rev 97(2):311–331
Google Scholar
Liaw A, Wiener M (2015) Classification and regression by randomforest. R Cran Repository R package version 4.6-12
Lowe W (2008) Understanding wordscores. Polit Anal 16(4):356–371
Google Scholar
Mahalakshmi S, Sivasankar E (2015) Cross domain sentiment analysis using different machine learning techniques. In: Ravi V, Panigrahi BK, Das S, Suganthan PN (eds) Proceedings of the fifth international conference on fuzzy and neuro computing. Springer, Cham, FANCCO-2015, pp 77–87
Manning CD, Raghavan P, tze Hinrich S (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
MATH Google Scholar
Martin LW, Vanberg G (2008) A robust transformation procedure for interpreting political text. Polit Anal 16(1):93–100
Google Scholar
Monroe BL, Maeda K (2004) Talk’s cheap: text-based estimation of rhetorical ideal-points. In: 21st annual meeting of the Society for Political Methodology, pp 29–31
Mudinas A, Zhang D, Levene M (2012) Combining lexicon and learning based approaches for concept-level sentiment analysis. In: Proceedings of the first international workshop on issues of sentiment discovery and opinion mining. ACM, New York, WISDOM ’12, pp 1–8
Mukherjee S, Bhattacharyya P (2013) Sentiment analysis : a literature survey. arXiv preprint arXiv:1304.4520
Müller M (2015) What makes an event a mega-event? Definitions and sizes. Leis Stud 34(6):627–642
Google Scholar
Nirmala CR, Roopa GM, Kumar KRN (2015) Twitter data analysis for unemployment crisis. In: 2015 international conference on applied and theoretical computing and communication technology. Davanagere, Karnataka, India. iCATccT, pp 420–423
Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retrivial 2(1–2):1–135
Google Scholar
Pang B, Lee L, Vaithyanathan S (2002) Thumbs up?: Sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 conference on empirical methods in natural language processing, vol 10. Association for Computational Linguistics, Stroudsburg, EMNLP ’02, pp 79–86
Ponzi LJ, Fombrun CJ, Gardberg NA (2011) Reptrak™ pulse: conceptualizing and validating a short-form measure of corporate reputation. Corp Reput Rev 14(1):15–35
Google Scholar
Rao Y, Lei J, Wenyin L, Li Q, Chen M (2014a) Building emotional dictionary for sentiment analysis of online news. World Wide Web 17(4):723–742
Google Scholar
Rao Y, Li Q, Mao X, Wenyin L (2014b) Sentiment topic models for social emotion mining. Inf Sci 266:90–100
Google Scholar
Rayner J (2004) Managing reputational risk: curbing threats, leveraging opportunities. Wiley, New York
Google Scholar
Ribeiro FN, Araújo M, Gonçalves P, André Gonçalves M, Benevenuto F (2016) Sentibench—a benchmark comparison of state-of-the-practice sentiment analysis methods. EPJ Data Sci 5(1):23
Google Scholar
Roberts ME, Stewart BM, Airoldi EM (2016) A model of text for experimentation in the social sciences. J Am Stat Assoc 111(515):988–1003
MathSciNet Google Scholar
Salter-Townshend M, Murphy TB (2014) Mixtures of biased sentiment analysers. Adv Data Anal Classif 8(1):85–103
MathSciNet MATH Google Scholar
Slapin JB, Proksch SO (2008) A scaling model for estimating time-series party positions from texts. Am J Polit Sci 52(3):705–722
Google Scholar
Solari D, Sciandra A, Rinaldo M, Redaelli M, Finos L (2016) Textwiller: collection of functions for text mining, specially devoted to the Italian language. https://github com/livioivil/TextWiller
Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21
Google Scholar
Stone PJ, Dexter CD, Smith MS, Ogilvie DM (1968) The general inquirer: a computer approach to content analysis. Am J Sociol 73(5):634–635
Google Scholar
Taboada M, Brooke J, Tofiloski M, Voll K, Stede M (2011) Lexicon-based methods for sentiment analysis. Comput Ling 37(2):267–307
Google Scholar
Tian F, Wu F, Chao KM, Zheng Q, Shah N, Lan T, Yue J (2016) A topic sentence-based instance transfer method for imbalanced sentiment classification of chinese product reviews. Electron Commerce Res Appl 16:66–76
Google Scholar
Tripathy A, Agrawal A, Rath SK (2016) Classification of sentiment reviews using n-gram machine learning approach. Expert Syst Appl 57:117–126
Google Scholar
Zhao H, Ji X, Zeng Q, Jiang S (2016) A teaching evaluation method based on sentiment classification. Int J Comput Sci Math 7(1):54–62
Google Scholar
Zhou Z, Zhang X, Sanderson M (2014) Sentiment analysis on twitter through topic-based lexicon expansion. In: Wang H, Sharaf MA (eds) Databases theory and applications. Springer, Cham, pp 98–109
Google Scholar

Download references

Author information

Authors and Affiliations

MOX-Department of Mathematics, Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133, Milan, Italy
Anna Calissano & Simone Vantini
Department of Management, Economics and Industrial Engineering, Politecnico di Milano, Via Lambruschini, 4/B, 20156, Milan, Italy
Marika Arena

Authors

Anna Calissano
View author publications
You can also search for this author in PubMed Google Scholar
Simone Vantini
View author publications
You can also search for this author in PubMed Google Scholar
Marika Arena
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anna Calissano.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

Proof of Theorem 1.

Proof

Consider the result of Theorem 1, the demonstration is shown for a general number of stems K and number of categories J. Consider S a multinomial variable assuming $S_1,\ldots ,S_{2^K}$ possible values and D multinomial variable assuming $D_1,\ldots ,D_J$ possible values. By definition, A is a $2^K\times 2^K$ diagonal matrix and B is a $J\times J$ diagonal matrix. In matrix terms, the Eq. (8) can be re-written as following:

$$\begin{aligned}&\left[ \begin{array}{ccc} A_{1 1} &{} &{} \\ &{} \ddots &{} \\ &{} &{} A_{2^K 2^K} \end{array}\right] \left[ \begin{array}{ccc} P^{RTr}(S=S_1|D=D_1) &{} \dots &{} P^{RTr}(S=S_1|D=D_J) \\ P^{RTr}(S=S_2|D=D_1) &{} \dots &{} P^{RTr}(S=S_2|D=D_J) \\ \vdots &{} \ddots &{} \vdots \\ P^{RTr}(S=S_{2^K}|D=D_1) &{} \dots &{} P^{RTr}(S=S_{2^K}|D=D_J) \end{array}\right] \left[ \begin{array}{ccc} B_{1 1} &{} &{} \\ &{} \ddots &{} \\ &{} &{} B_{J J} \end{array}\right] \\&\quad = \left[ \begin{array}{ccc} P(S=S_1|D=D_1) &{} \dots &{} P(S=S_1|D=D_J) \\ P(S=S_2|D=D_1) &{} \dots &{} P(S=S_2|D=D_J) \\ \vdots &{} \ddots &{} \vdots \\ P(S=S_{2^K}|D=D_1) &{} \dots &{} P(S=S_{2^K}|D=D_J) \end{array}\right] \\ \end{aligned}$$

Due to the matrixes’ structure, we can prove the equality component-wise, by considering the general component (i, j):

$$\begin{aligned}{}[A_{ii}P^{RTr}(S=S_i|D=D_j)B_{jj}]_{ij}=[P(S=S_i|D=D_j)]_{ij} \end{aligned}$$

For seek of simplicity, we miss the (i, j) subscripts along the demonstration.

Consider the left side of the equality, substitute $A_{ii}$, and apply Bayes Formula:

$$\begin{aligned}&\dfrac{P^{Tr}(S=S_i)P^{RTr}(S=S_i|D=D_j)B_{jj}}{P^{RTr}(S=S_i)}\\&\quad =\dfrac{P^{Tr}(S=S_i)P^{RTr}(D=D_j|S=S_i)P^{Rtr}(S=S_i)B_{jj}}{P^{RTr}(S=S_i)P^{RTr}(D=D_j)} \end{aligned}$$

Substituting $B_{j j}$ :

$$\begin{aligned}&\dfrac{P^{Tr}(S=S_i)P^{RTr}(D=D_j|S=S_i)}{P^{RTr}(D=D_j)\sum \limits _{n=1}^{2^K}{P^{RTr}(S=S_n|D=D_j)A_n}}\\&\quad =\dfrac{P^{Tr}(S=S_i)P^{RTr}(D=D_j|S=S_i)}{P^{RTr}(D=D_j)\sum \limits _{n=1}^{2^K}{\dfrac{P^{RTr}(D=D_j|S=S_n)P^{Tr}(S=S_n)}{P^{RTr}(D=D_j)}}} \end{aligned}$$

Using the hypothesis $P^{RTr}(D|S)=P^{Tr}(D|S)$ and the law of total probability:

$$\begin{aligned}&\dfrac{P^{Tr}(S=S_i)P^{Tr}(D=D_j|S=S_i)}{\sum \nolimits _{n=1}^{2^K}P^{Tr}(D=D_j|S=S_n)P^{Tr}(S=S_n)}\\&\quad =\dfrac{P^{Tr}(S=S_i)P^{Tr}(D=D_j|S=S_i)}{P^{Tr}(D=D_j)}\\&\quad =P^{Tr}(S=S_i|D=D_j) \end{aligned}$$

For hypothesis:

$$\begin{aligned} P^{Tr}(S=S_i|D=D_j)=P(S=S_i|D=D_j) \end{aligned}$$

So we can write the following:

$$\begin{aligned}{}[A_{ii}P^{RTr}(S=S_i|D=D_j)B_{jj}]_{ij}=[P^{Tr}(S=S_i|D=D_j)]_{ij}=[P(S=S_i|D=D_j)]_{ij} \end{aligned}$$

Appendix 2

List of opinion and sentiment categories defined for the analysis of web-reputation of Expo Milan (Tables 1, 2).

Table 1 Sentiment analysis: description of the sentiment categories

Full size table

Table 2 Opinion analysis: description of the positive (negative) opinion categories. Every positive category has a corresponding negative one. Neutral, Off-Topics, and Advertise categories are also estimated in opinion analysis

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Calissano, A., Vantini, S. & Arena, M. Monitoring rare categories in sentiment and opinion analysis: a Milan mega event on Twitter platform. Stat Methods Appl 29, 787–812 (2020). https://doi.org/10.1007/s10260-019-00504-7

Download citation

Accepted: 08 December 2019
Published: 17 December 2019
Issue Date: December 2020
DOI: https://doi.org/10.1007/s10260-019-00504-7

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Monitoring rare categories in sentiment and opinion analysis: a Milan mega event on Twitter platform

Abstract

Access this article

Similar content being viewed by others

Classification, Identification, and Analysis of Events on Twitter Through Data Mining

Sentiment Analysis for Tracking Breaking Events: A Case Study on Twitter

Text mining and determinants of sentiments: Twitter social media usage by traditional media houses in Uganda

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix 1

Proof

Appendix 2

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Monitoring rare categories in sentiment and opinion analysis: a Milan mega event on Twitter platform

Abstract

Access this article

Similar content being viewed by others

Classification, Identification, and Analysis of Events on Twitter Through Data Mining

Sentiment Analysis for Tracking Breaking Events: A Case Study on Twitter

Text mining and determinants of sentiments: Twitter social media usage by traditional media houses in Uganda

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix 1

Proof

Appendix 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation