On negative results when using sentiment analysis tools for software engineering research

Jongeling, Robbert; Sarkar, Proshanta; Datta, Subhajit; Serebrenik, Alexander

doi:10.1007/s10664-016-9493-x

On negative results when using sentiment analysis tools for software engineering research

Open access
Published: 10 January 2017

Volume 22, pages 2543–2584, (2017)
Cite this article

Download PDF

You have full access to this open access article

Empirical Software Engineering Aims and scope Submit manuscript

On negative results when using sentiment analysis tools for software engineering research

Download PDF

Robbert Jongeling¹,
Proshanta Sarkar²,
Subhajit Datta³ &
…
Alexander Serebrenik ORCID: orcid.org/0000-0002-1418-0095¹

11k Accesses
126 Citations
7 Altmetric
Explore all metrics

Abstract

Recent years have seen an increasing attention to social aspects of software engineering, including studies of emotions and sentiments experienced and expressed by the software developers. Most of these studies reuse existing sentiment analysis tools such as SentiStrength and NLTK. However, these tools have been trained on product reviews and movie reviews and, therefore, their results might not be applicable in the software engineering domain. In this paper we study whether the sentiment analysis tools agree with the sentiment recognized by human evaluators (as reported in an earlier study) as well as with each other. Furthermore, we evaluate the impact of the choice of a sentiment analysis tool on software engineering studies by conducting a simple study of differences in issue resolution times for positive, negative and neutral texts. We repeat the study for seven datasets (issue trackers and Stack Overflow questions) and different sentiment analysis tools and observe that the disagreement between the tools can lead to diverging conclusions. Finally, we perform two replications of previously published studies and observe that the results of those studies cannot be confirmed when a different sentiment analysis tool is used.

A survey on sentiment analysis methods, applications, and challenges

Article 07 February 2022

A review on sentiment analysis and emotion detection from text

Article 28 August 2021

Sentiment Analysis in the Age of Generative AI

Article Open access 05 March 2024

1 Introduction

Sentiment analysis is “the task of identifying positive and negative opinions, emotions, and evaluations” (Wilson et al. 2005). Since its inception sentiment analysis has been subject of an intensive research effort and has been successfully applied e.g., to assist users in their development by providing them with interesting and supportive content (Honkela et al. 2012), predict the outcome of an election (Tumasjan et al. 2010) or movie sales (Mishne and Glance 2006). The spectrum of sentiment analysis techniques ranges from identifying polarity (positive or negative) to a complex computational treatment of subjectivity, opinion and sentiment (Pang and Lee 2007). In particular, the research on sentiment polarity analysis has resulted in a number of mature and publicly available tools such as SentiStrength (Thelwall et al. 2010), Alchemy,^{Footnote 1} Stanford NLP sentiment analyser (Socher et al. 2013) and NLTK (Bird et al. 2009).

In recent times, large scale software development has become increasingly social. With the proliferation of collaborative development environments, discussion between developers are recorded and archived to an extent that could not be conceived before. The availability of such discussion materials makes it easy to study whether and how the sentiments expressed by software developers influence the outcome of development activities. With this background, we apply sentiment polarity analysis to several software development ecosystems in this study.

Sentiment polarity analysis has been recently applied in the software engineering context to study commit comments in GitHub (Guzman et al. 2014), GitHub discussions related to security (Pletea et al. 2014), productivity in Jira issue resolution (Ortu et al. 2015), activity of contributors in Gentoo (Garcia et al. 2013), classification of user reviews for maintenance and evolution (Panichella et al. 2015) and evolution of developers’ sentiments in the openSUSE Factory (Rousinopoulos et al. 2014). It has also been suggested when assessing technical candidates on the social web (Capiluppi et al. 2013). Not surprisingly, all the aforementioned software engineering studies with the notable exception of the work by Panichella et al. (2015), reuse the existing sentiment polarity tools, e.g., (Pletea et al. 2014) and (Rousinopoulos et al. 2014) use NLTK, while (Garcia et al. 2013; Guzman and Bruegge 2013; Guzman et al. 2014; Novielli et al. 2015) and (Ortu et al. 2015) opted for SentiStrength. While the reuse of the existing tools facilitated the application of the sentiment polarity analysis techniques in the software engineering domain, it also introduced a commonly recognized threat to validity of the results obtained: those tools have been trained on non-software engineering related texts such as movie reviews or product reviews and might misidentify (or fail to identify) polarity of a sentiment in a software engineering artefact such as a commit comment (Guzman et al. 2014; Pletea et al. 2014).

Therefore, in this paper we focus on sentiment polarity analysis (Wilson et al. 2005) and investigate to what extent are the software engineering results obtained from sentiment analysis depend on the choice of the sentiment analysis tool. We recognize that there are multiple ways to measure outcomes in software engineering. Among them, time to resolve a particular defect, and/or respond to a particular query are relevant for end users. Accordingly, in the different data-sets studied in this paper, we have taken such resolution or response times to reflect the outcomes of our interest.

For the sake of simplicity, from here on, instead of “existing sentiment polarity analysis tools” we talk about the “sentiment analysis tools”. Specifically, we aim at answering the following questions:

RQ1: To what extent do different sentiment analysis tools agree with emotions of software developers?
RQ2: To what extent do results from different sentiment analysis tools agree with each other?

We have observed disagreement between sentiment analysis tools and the emotions of software developers but also between different sentiment analysis tools themselves. However, disagreement between the tools does not a priori mean that sentiment analysis tools might lead to contradictory results in software engineering studies making use of these tools. Thus, we ask

RQ3: Do different sentiment analysis tools lead to contradictory results in a software engineering study?

We have observed that disagreement between the tools might lead to contradictory results in software engineering studies. Therefore, we finally conduct replication studies in order to understand:

RQ4: How does the choice of a sentiment analysis tool affect validity of the previously published results?

The remainder of this paper is organized as follows. The next section outlines the sentiment analysis tools we have considered in this study. In Section 3 we study agreement between the tools and the results of manual labeling, and between the tools themselves, i.e., RQ1 and RQ2. In Section 4 we conduct a series of studies based on the results of different sentiment analysis tools. We observe that conclusions one might derive using different tools diverge, casting doubt on their validity (RQ3). While our answer to RQ3 indicates that the choice of a sentiment analysis tool might affect validity of software engineering results, in Section 5 we perform replication of two published studies answering RQ4 and establishing that conclusions of previously published works cannot be reproduced when a different sentiment analysis tool is used. Finally, in Section 6 we discuss related work and conclude in Section 7.

Source code and data used to obtain the results of this paper has been made available.^{Footnote 2}

2 Sentiment Analysis Tools

2.1 Tool Selection

To perform the tool evaluation we have decided to focus on open-source tools. This requirement excludes such commercial tools as Lymbix^{Footnote 3} Sentiment API of MeaningCloud^{Footnote 4} or GetSentiment.^{Footnote 5} Furthermore, we exclude tools that require training before they can be applied such as LibShortText (Yu et al. 2013) or sentiment analysis libraries of popular machine learning tools such as RapidMiner or Weka. Finally, since the software engineering texts that have been analyzed in the past can be quite short (JIRA issues, Stack Overflow questions), we have chosen tools that have already been applied either to software engineering texts (SentiStrength and NLTK) or to short texts such as tweets (Alchemy or Stanford NLP sentiment analyser).

2.2 Description of Tools

2.2.1 SentiStrength

SentiStrength is the sentiment analysis tool most frequently used in software engineering studies (Garcia et al. 2013; Guzman et al. 2014; Novielli et al. 2015; Ortu et al. 2015). Moreover, SentiStrength had the highest average accuracy among fifteen Twitter sentiment analysis tools (Abbasi et al. 2014). SentiStrength assigns an integer value between 1 and 5 for the positivity of a text, p and similarly, a value between −1 and −5 for the negativity, n.

Interpretation

In order to map the separate positivity and negativity scores to a sentiment (positive, neutral or negative) for an entire text fragment, we follow the approach by Thelwall et al. (Thelwall et al. 2012). A text is considered positive when p + n > 0, negative when p + n < 0, and neutral if p = −n and p < 4. Texts with a score of p = −n and p ≥ 4 are considered having an undetermined sentiment and are removed from the datasets.

2.2.2 Alchemy

Alchemy provides several text processing APIs, including a sentiment analysis API which promises to work on very short texts (e.g., tweets) as well as relatively long texts (e.g., news articles).^{Footnote 6} The sentiment analysis API returns for a text fragment a status, a language, a score and a type. The score is in the range [−1,1], the type is the sentiment of the text and is based on the score. For negative scores, the type is negative, conversely for positive scores, the type is positive. For a score of 0, the type is neutral. The status reflects the analysis success and it is either “OK” or “ERROR”.

Interpretation

We ignore texts with status “ERROR” or a non-English language. For the remaining texts we consider them as being negative, neutral or positive as indicated by the returned type.

2.2.3 NLTK

NLTK has been applied in earlier software engineering studies (Pletea et al. 2014; Rousinopoulos et al. 2014). NLTK uses a simple bag of words model and returns for each text three probabilities: a probability of the text being negative, one of it being neutral and one of it being positive. To call NLTK, we use the API provided at text-processing.com.^{Footnote 7}

Interpretation

If the probability score for neutral is greater than 0.5, the text is considered neutral. Otherwise, it is considered to be the other sentiment with the highest probability (Pletea et al. 2014).

2.2.4 Stanford NLP

The Stanford NLP parses the text into sentences and performs a more advanced grammatical analysis as opposed to a simpler bag of words model used, e.g., in NLTK. Indeed, Socher et al. argue that such an analysis should outperform the bag of words model on short texts (Socher et al. 2013). The Stanford NLP breaks down the text into sentences and assigns each a sentiment score in the range [0,4], where 0 is very negative, 2 is neutral and 4 is very positive. We note that the tool may have difficulty breaking the text into sentences as comments sometimes include pieces of code or e.g. URLs. The tool does not provide a document-level score.

Interpretation

To determine a document-level sentiment we compute −2∗#0−#1 + #3+2∗#4, where #0 denotes the number of sentences with score 0, etc.. If this score is negative, neutral or positive, we consider the text to be negative, neutral or positive, respectively.

3 Agreement Between Sentiment Analysis Tools

In this section we address RQ1 and RQ2, i.e., to what extent do the different sentiment analysis tools described earlier, agree with emotions of software developers and to what extent do different sentiment analysis tools agree with each other. To perform the evaluation we use the manually labeled emotions dataset (Murgia et al. 2014).

3.1 Methodology

3.1.1 Manually-Labeled Software Engineering Data

As the “golden set” we use the data from a developer emotions study by Murgia et al. (2014). In this study, four evaluators manually labeled 392 comments with emotions “joy”, “love”, “surprise”, “anger”, “sadness” or “fear”. Emotions “joy” and“love” are taken as indicators of positive sentiments and “anger”, “sadness” and “fear”—of negative sentiment. We exclude information about the “surprise” sentiment, since surprises can be, in general, both positive and negative depending on the expectations of the speaker.

We focus on consistently labeled comments. We consider the comment as positive if at least three evaluators have indicated a positive sentiment and no evaluator has indicated negative sentiments. Similarly, we consider the comment as negative if at least three evaluators have indicated a negative sentiment and no evaluator has indicated positive sentiments. Finally, a text is considered as neutral when three or more evaluators have neither indicated a positive sentiment nor a negative sentiment.

Using these rules we can conclude that 265 comments have been labeled consistently: 19 negative, 41 positive and 205 neutral. The remaining 392−265=127 comments from the study Murgia et al. (2014) have been labeled with contradictory labels e.g. “fear” by one evaluator and “joy” by another.

3.1.2 Evaluation Metrics

Since more than 77 % of the comments have been manually labeled as neutral, i.e., the dataset is unbalanced, traditional metrics such as accuracy might be misleading (Batista et al. 2000): indeed, accuracy of the straw man sentiment analysis predicting “neutral” for any comment can be easily higher than of any of the four tools. Therefore, rather than reporting accuracy of the approaches we use the Weighted kappa (Cohen 1968) and the Adjusted Rand Index (ARI) (Hubert and Arabie 1985; Santos and Embrechts 2009). For the sake of completeness we report the F-measures for the three categories of sentiments.

Kappa is a measure of interrater agreement. As recommended by Bakeman and Gottman (Bakeman and Gottman 1997, p. 66) we opt for the weighted kappa (κ) since the sentiments can be seen as ordered, from positive through neutral to negative, and disagreement between positive and negative is more “severe” than between positive and neutral or negative and neutral. Our weighting scheme, also following the guidelines of Bakeman and Gottman, is shown in Table 1. We follow the interpretation of κ as advocated by Viera and Garrett (Viera and Garrett 2005) since it is more fine grained than, e.g., the one suggested by Fleiss et al. (2003, p. 609). We say that the agreement is less than chance if κ ≤ 0, slight if 0.01≤κ ≤ 0.20, fair if 0.21≤κ ≤ 0.40, moderate if 0.41≤κ ≤ 0.60, substantial if 0.61≤κ ≤ 0.80 and almost perfect if 0.81≤κ ≤ 1. To answer the first research question we look for the agreement between the tool and the manual labeling; to answer the second one—for agreement between two tools.

Table 1 Weighting scheme for the weighted kappa computation

On negative results when using sentiment analysis tools for software engineering research

Abstract

Similar content being viewed by others

A survey on sentiment analysis methods, applications, and challenges

A review on sentiment analysis and emotion detection from text

Sentiment Analysis in the Age of Generative AI

1 Introduction

2 Sentiment Analysis Tools

2.1 Tool Selection

2.2 Description of Tools

2.2.1 SentiStrength

Interpretation

2.2.2 Alchemy

Interpretation

2.2.3 NLTK

Interpretation

2.2.4 Stanford NLP

Interpretation

3 Agreement Between Sentiment Analysis Tools

3.1 Methodology

3.1.1 Manually-Labeled Software Engineering Data

3.1.2 Evaluation Metrics

3.2 Results

3.3 Discussion

RQ1

RQ2

Example 1

Example 2

3.4 A Follow-up Study

3.5 Threats to Validity

3.6 Summary

4 Impact of the Choice of Sentiment Analysis Tool

4.1 Methodology

4.1.1 Sentiment Analysis Tools

4.1.2 Datasets

Android Issue Tracker

Gnome Issue Tracker

Gnome-Related Stack Overflow Discussions

ASF Issue Tracker

4.1.3 Politeness Analysis

4.1.4 Statistical Analysis

4.1.5 Agreement Between the Results

Example 3

Example 4

Example 5

4.2 Results

4.3 Discussion

4.4 Threats to Validity

5 Implications on Earlier Studies

5.1 Replicated Studies

5.2 Replication Approach

5.2.1 Pletea et al.

5.2.2 Guzman et al.

5.3 Replication Results

5.3.1 Pletea et al.

5.3.2 Guzman et al.

5.4 Discussion

5.5 Threats to Validity

6 Related Work

6.1 Sentiment Analysis in Large Text Corpora

6.2 Sentiment Analysis Application in Software Engineering

6.3 Sentiment Analysis Tools

6.4 Replications and Negative Results

7 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix A: Agreement of Sentiment Snalysis Tools with the Manual Labeling and with each other

Appendix B: Comparison of NLTK and SentiStrength in Combination with Politeness

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search