Abstract
This paper proposes a new aggregated classification scheme aimed to support the implementation of semantic text analysis methods in contexts characterized by the presence of rare text categories. The proposed approach starts from the aggregate supervised text classifier developed by Hopkins and King and moves forward, relying on rare event sampling methods. In detail, it enables the analyst to enlarge the number of estimated sentiment categories, both preserving the estimation accuracy and reducing the working time to unconditionally increase the size of the training set. The approach is applied to study the daily evolution of the web reputation of one of the last mega-event taking place in Europe: Expo Milano. The corpus consists of more than one million tweets in both Italian and English, discussing about the event. The analysis provides an interesting portrayal of the evolution of the Expo stakeholders’ opinions over time and allows the identification of the main drivers of the Expo reputation. The algorithm will be implemented as a running option in the next release of the R package ReadMe.
Similar content being viewed by others
References
Agosti M, Bacchin M, Ferro N, Melucci M (2002) Improving the automatic retrieval of text documents. In: Workshop of the cross-language evaluation forum for European Languages. Springer, pp 279–290
Aprosio AP, Moretti G (2016) Italy goes to stanford: a collection of corenlp modules for italian. arXiv preprint arXiv:1609.06204
Berger AL, Pietra VJD, Pietra SAD (1996) A maximum entropy approach to natural language processing. Comput Ling 22(1):39–71
Blei DM (2012) Probabilistic topic models. Commun ACM 55(4):77–84
Blei DM, Lafferty JD (2007) A correlated topic model of science. Ann Appl Stat 1(1):17–35
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Bouchet-Valat M (2014) SnowballC: snowball stemmers based on the C libstemmer UTF-8 library. R package version 0.5.1
Breiman L, Friedman J, Stone C, Olshen R (1984) Classification and regression trees. The Wadsworth and Brooks-Cole statistics-probability series. Chapman & Hall, New York
Breslow NE (1996) Statistics in epidemiology: the case–control study. J Am Stat Assoc 91(433):14–28
Ceron A, Curini L, Iacus SM (2015) Using social media to forecast electoral results: a review of state-of-the-art. Stat Appl Ital J Appl Stat 25(3):239–261
Ceron A, Curini L, Iacus SM (2016) isa: a fast, scalable and accurate algorithm for sentiment analysis of social media content. Inf Sci 367:105–124
Chen H, Chiang RH, Storey VC (2012) Business intelligence and analytics: from big data to big impact. MIS Q 36(4):1165–1188
Choi D, Kim P (2013) Sentiment analysis for tracking breaking events: a case study on twitter. Asian conference on intelligent information and database systems. Springer, Berlin, pp 285–294
Corallo A, Fortunato L, Matera M, Alessi M, Camillò A, Chetta V, Giangreco E, Storelli D (2015) Sentiment analysis for government: an optimized approach. In: Perner P (ed) Machine learning and data mining in pattern recognition. Springer, Cham, pp 98–112
da Silva NF, Hruschka ER, Hruschka ER (2014) Tweet sentiment analysis with classifier ensembles. Decis Support Syst 66:170–179
Das SR, Chen MY (2007) Yahoo! for Amazon: sentiment extraction from small talk on the web. Manag Sci 53(9):1375–1388
Dave K, Lawrence S, Pennock DM (2003) Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th international conference on World Wide Web. ACM, New York, WWW ’03, pp 519–528
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Science 41(6):391–407
Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York
Erosheva E, Fienberg S, Lafferty J (2004) Mixed-membership models of scientific publications. Proc Natl Acad Sci 101(suppl 1):5220–5227
ExpoMilano (2015) Expo Milano 2015: La sfida dell’italia per un’esplosione universale innovativa. www.expo2015.org
Feinerer I, Hornik K (2017) tm: Text Mining Package. R package version 0.7-3
Gentry J (2015) twitteR: R based Twitter Client. R package version 1.1.9
Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. Nature 1(12):1–6
Grimmer J, Stewart BM (2013) Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Polit Anal 21(3):267–297
Hand DJ (2006) Classifier technology and the illusion of progress. Stat Sci 21(1):1–14
Hopkins DJ, King G (2010) A method of automated nonparametric content analysis for social science. Am J Polit Sci 54(1):229–247
Hopkins D, King G (2017) ReadMe: software for automated content analysis. R package version 0.99837
Inversini A, Marchiori E, Dedekind C, Cantoni L (2010) Applying a conceptual framework to analyze online reputation of tourism destinations. In: Gretzel U, Law R, Fuchs M (eds) Information and communication technologies in tourism 2010. Springer Vienna, Vienna, pp 321–332
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Nédellec C, Rouveirol C (eds) Machine learning: ECML-98. Springer, Berlin, pp 137–142
King G, Zeng L (2001) Logistic regression in rare events data. Polit Anal 9(2):137–163
Laver M, Benoit K, Garry J (2003) Extracting policy positions from political texts using words as data. Am Polit Sci Rev 97(2):311–331
Liaw A, Wiener M (2015) Classification and regression by randomforest. R Cran Repository R package version 4.6-12
Lowe W (2008) Understanding wordscores. Polit Anal 16(4):356–371
Mahalakshmi S, Sivasankar E (2015) Cross domain sentiment analysis using different machine learning techniques. In: Ravi V, Panigrahi BK, Das S, Suganthan PN (eds) Proceedings of the fifth international conference on fuzzy and neuro computing. Springer, Cham, FANCCO-2015, pp 77–87
Manning CD, Raghavan P, tze Hinrich S (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Martin LW, Vanberg G (2008) A robust transformation procedure for interpreting political text. Polit Anal 16(1):93–100
Monroe BL, Maeda K (2004) Talk’s cheap: text-based estimation of rhetorical ideal-points. In: 21st annual meeting of the Society for Political Methodology, pp 29–31
Mudinas A, Zhang D, Levene M (2012) Combining lexicon and learning based approaches for concept-level sentiment analysis. In: Proceedings of the first international workshop on issues of sentiment discovery and opinion mining. ACM, New York, WISDOM ’12, pp 1–8
Mukherjee S, Bhattacharyya P (2013) Sentiment analysis : a literature survey. arXiv preprint arXiv:1304.4520
Müller M (2015) What makes an event a mega-event? Definitions and sizes. Leis Stud 34(6):627–642
Nirmala CR, Roopa GM, Kumar KRN (2015) Twitter data analysis for unemployment crisis. In: 2015 international conference on applied and theoretical computing and communication technology. Davanagere, Karnataka, India. iCATccT, pp 420–423
Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retrivial 2(1–2):1–135
Pang B, Lee L, Vaithyanathan S (2002) Thumbs up?: Sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 conference on empirical methods in natural language processing, vol 10. Association for Computational Linguistics, Stroudsburg, EMNLP ’02, pp 79–86
Ponzi LJ, Fombrun CJ, Gardberg NA (2011) Reptrak™ pulse: conceptualizing and validating a short-form measure of corporate reputation. Corp Reput Rev 14(1):15–35
Rao Y, Lei J, Wenyin L, Li Q, Chen M (2014a) Building emotional dictionary for sentiment analysis of online news. World Wide Web 17(4):723–742
Rao Y, Li Q, Mao X, Wenyin L (2014b) Sentiment topic models for social emotion mining. Inf Sci 266:90–100
Rayner J (2004) Managing reputational risk: curbing threats, leveraging opportunities. Wiley, New York
Ribeiro FN, Araújo M, Gonçalves P, André Gonçalves M, Benevenuto F (2016) Sentibench—a benchmark comparison of state-of-the-practice sentiment analysis methods. EPJ Data Sci 5(1):23
Roberts ME, Stewart BM, Airoldi EM (2016) A model of text for experimentation in the social sciences. J Am Stat Assoc 111(515):988–1003
Salter-Townshend M, Murphy TB (2014) Mixtures of biased sentiment analysers. Adv Data Anal Classif 8(1):85–103
Slapin JB, Proksch SO (2008) A scaling model for estimating time-series party positions from texts. Am J Polit Sci 52(3):705–722
Solari D, Sciandra A, Rinaldo M, Redaelli M, Finos L (2016) Textwiller: collection of functions for text mining, specially devoted to the Italian language. https://github com/livioivil/TextWiller
Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21
Stone PJ, Dexter CD, Smith MS, Ogilvie DM (1968) The general inquirer: a computer approach to content analysis. Am J Sociol 73(5):634–635
Taboada M, Brooke J, Tofiloski M, Voll K, Stede M (2011) Lexicon-based methods for sentiment analysis. Comput Ling 37(2):267–307
Tian F, Wu F, Chao KM, Zheng Q, Shah N, Lan T, Yue J (2016) A topic sentence-based instance transfer method for imbalanced sentiment classification of chinese product reviews. Electron Commerce Res Appl 16:66–76
Tripathy A, Agrawal A, Rath SK (2016) Classification of sentiment reviews using n-gram machine learning approach. Expert Syst Appl 57:117–126
Zhao H, Ji X, Zeng Q, Jiang S (2016) A teaching evaluation method based on sentiment classification. Int J Comput Sci Math 7(1):54–62
Zhou Z, Zhang X, Sanderson M (2014) Sentiment analysis on twitter through topic-based lexicon expansion. In: Wang H, Sharaf MA (eds) Databases theory and applications. Springer, Cham, pp 98–109
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1
Proof of Theorem 1.
Proof
Consider the result of Theorem 1, the demonstration is shown for a general number of stems K and number of categories J. Consider S a multinomial variable assuming \(S_1,\ldots ,S_{2^K}\) possible values and D multinomial variable assuming \(D_1,\ldots ,D_J\) possible values. By definition, A is a \(2^K\times 2^K\) diagonal matrix and B is a \(J\times J\) diagonal matrix. In matrix terms, the Eq. (8) can be re-written as following:
Due to the matrixes’ structure, we can prove the equality component-wise, by considering the general component (i, j):
For seek of simplicity, we miss the (i, j) subscripts along the demonstration.
Consider the left side of the equality, substitute \(A_{ii}\), and apply Bayes Formula:
Substituting \(B_{j j}\) :
Using the hypothesis \(P^{RTr}(D|S)=P^{Tr}(D|S)\) and the law of total probability:
For hypothesis:
So we can write the following:
Appendix 2
List of opinion and sentiment categories defined for the analysis of web-reputation of Expo Milan (Tables 1, 2).
Rights and permissions
About this article
Cite this article
Calissano, A., Vantini, S. & Arena, M. Monitoring rare categories in sentiment and opinion analysis: a Milan mega event on Twitter platform. Stat Methods Appl 29, 787–812 (2020). https://doi.org/10.1007/s10260-019-00504-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10260-019-00504-7