1 Introduction

Sentiment analysis is “the task of identifying positive and negative opinions, emotions, and evaluations” (Wilson et al. 2005). Since its inception sentiment analysis has been subject of an intensive research effort and has been successfully applied e.g., to assist users in their development by providing them with interesting and supportive content (Honkela et al. 2012), predict the outcome of an election (Tumasjan et al. 2010) or movie sales (Mishne and Glance 2006). The spectrum of sentiment analysis techniques ranges from identifying polarity (positive or negative) to a complex computational treatment of subjectivity, opinion and sentiment (Pang and Lee 2007). In particular, the research on sentiment polarity analysis has resulted in a number of mature and publicly available tools such as SentiStrength (Thelwall et al. 2010), Alchemy,Footnote 1 Stanford NLP sentiment analyser (Socher et al. 2013) and NLTK (Bird et al. 2009).

In recent times, large scale software development has become increasingly social. With the proliferation of collaborative development environments, discussion between developers are recorded and archived to an extent that could not be conceived before. The availability of such discussion materials makes it easy to study whether and how the sentiments expressed by software developers influence the outcome of development activities. With this background, we apply sentiment polarity analysis to several software development ecosystems in this study.

Sentiment polarity analysis has been recently applied in the software engineering context to study commit comments in GitHub (Guzman et al. 2014), GitHub discussions related to security (Pletea et al. 2014), productivity in Jira issue resolution (Ortu et al. 2015), activity of contributors in Gentoo (Garcia et al. 2013), classification of user reviews for maintenance and evolution (Panichella et al. 2015) and evolution of developers’ sentiments in the openSUSE Factory (Rousinopoulos et al. 2014). It has also been suggested when assessing technical candidates on the social web (Capiluppi et al. 2013). Not surprisingly, all the aforementioned software engineering studies with the notable exception of the work by Panichella et al. (2015), reuse the existing sentiment polarity tools, e.g., (Pletea et al. 2014) and (Rousinopoulos et al. 2014) use NLTK, while (Garcia et al. 2013; Guzman and Bruegge 2013; Guzman et al. 2014; Novielli et al. 2015) and (Ortu et al. 2015) opted for SentiStrength. While the reuse of the existing tools facilitated the application of the sentiment polarity analysis techniques in the software engineering domain, it also introduced a commonly recognized threat to validity of the results obtained: those tools have been trained on non-software engineering related texts such as movie reviews or product reviews and might misidentify (or fail to identify) polarity of a sentiment in a software engineering artefact such as a commit comment (Guzman et al. 2014; Pletea et al. 2014).

Therefore, in this paper we focus on sentiment polarity analysis (Wilson et al. 2005) and investigate to what extent are the software engineering results obtained from sentiment analysis depend on the choice of the sentiment analysis tool. We recognize that there are multiple ways to measure outcomes in software engineering. Among them, time to resolve a particular defect, and/or respond to a particular query are relevant for end users. Accordingly, in the different data-sets studied in this paper, we have taken such resolution or response times to reflect the outcomes of our interest.

For the sake of simplicity, from here on, instead of “existing sentiment polarity analysis tools” we talk about the “sentiment analysis tools”. Specifically, we aim at answering the following questions:

  • RQ1: To what extent do different sentiment analysis tools agree with emotions of software developers?

  • RQ2: To what extent do results from different sentiment analysis tools agree with each other?

We have observed disagreement between sentiment analysis tools and the emotions of software developers but also between different sentiment analysis tools themselves. However, disagreement between the tools does not a priori mean that sentiment analysis tools might lead to contradictory results in software engineering studies making use of these tools. Thus, we ask

  • RQ3: Do different sentiment analysis tools lead to contradictory results in a software engineering study?

We have observed that disagreement between the tools might lead to contradictory results in software engineering studies. Therefore, we finally conduct replication studies in order to understand:

  • RQ4: How does the choice of a sentiment analysis tool affect validity of the previously published results?

The remainder of this paper is organized as follows. The next section outlines the sentiment analysis tools we have considered in this study. In Section 3 we study agreement between the tools and the results of manual labeling, and between the tools themselves, i.e., RQ1 and RQ2. In Section 4 we conduct a series of studies based on the results of different sentiment analysis tools. We observe that conclusions one might derive using different tools diverge, casting doubt on their validity (RQ3). While our answer to RQ3 indicates that the choice of a sentiment analysis tool might affect validity of software engineering results, in Section 5 we perform replication of two published studies answering RQ4 and establishing that conclusions of previously published works cannot be reproduced when a different sentiment analysis tool is used. Finally, in Section 6 we discuss related work and conclude in Section 7.

Source code and data used to obtain the results of this paper has been made available.Footnote 2

2 Sentiment Analysis Tools

2.1 Tool Selection

To perform the tool evaluation we have decided to focus on open-source tools. This requirement excludes such commercial tools as LymbixFootnote 3 Sentiment API of MeaningCloudFootnote 4 or GetSentiment.Footnote 5 Furthermore, we exclude tools that require training before they can be applied such as LibShortText (Yu et al. 2013) or sentiment analysis libraries of popular machine learning tools such as RapidMiner or Weka. Finally, since the software engineering texts that have been analyzed in the past can be quite short (JIRA issues, Stack Overflow questions), we have chosen tools that have already been applied either to software engineering texts (SentiStrength and NLTK) or to short texts such as tweets (Alchemy or Stanford NLP sentiment analyser).

2.2 Description of Tools

2.2.1 SentiStrength

SentiStrength is the sentiment analysis tool most frequently used in software engineering studies (Garcia et al. 2013; Guzman et al. 2014; Novielli et al. 2015; Ortu et al. 2015). Moreover, SentiStrength had the highest average accuracy among fifteen Twitter sentiment analysis tools (Abbasi et al. 2014). SentiStrength assigns an integer value between 1 and 5 for the positivity of a text, p and similarly, a value between −1 and −5 for the negativity, n.

Interpretation

In order to map the separate positivity and negativity scores to a sentiment (positive, neutral or negative) for an entire text fragment, we follow the approach by Thelwall et al. (Thelwall et al. 2012). A text is considered positive when p + n > 0, negative when p + n < 0, and neutral if p = −n and p < 4. Texts with a score of p = −n and p ≥ 4 are considered having an undetermined sentiment and are removed from the datasets.

2.2.2 Alchemy

Alchemy provides several text processing APIs, including a sentiment analysis API which promises to work on very short texts (e.g., tweets) as well as relatively long texts (e.g., news articles).Footnote 6 The sentiment analysis API returns for a text fragment a status, a language, a score and a type. The score is in the range [−1,1], the type is the sentiment of the text and is based on the score. For negative scores, the type is negative, conversely for positive scores, the type is positive. For a score of 0, the type is neutral. The status reflects the analysis success and it is either “OK” or “ERROR”.

Interpretation

We ignore texts with status “ERROR” or a non-English language. For the remaining texts we consider them as being negative, neutral or positive as indicated by the returned type.

2.2.3 NLTK

NLTK has been applied in earlier software engineering studies (Pletea et al. 2014; Rousinopoulos et al. 2014). NLTK uses a simple bag of words model and returns for each text three probabilities: a probability of the text being negative, one of it being neutral and one of it being positive. To call NLTK, we use the API provided at text-processing.com.Footnote 7

Interpretation

If the probability score for neutral is greater than 0.5, the text is considered neutral. Otherwise, it is considered to be the other sentiment with the highest probability (Pletea et al. 2014).

2.2.4 Stanford NLP

The Stanford NLP parses the text into sentences and performs a more advanced grammatical analysis as opposed to a simpler bag of words model used, e.g., in NLTK. Indeed, Socher et al. argue that such an analysis should outperform the bag of words model on short texts (Socher et al. 2013). The Stanford NLP breaks down the text into sentences and assigns each a sentiment score in the range [0,4], where 0 is very negative, 2 is neutral and 4 is very positive. We note that the tool may have difficulty breaking the text into sentences as comments sometimes include pieces of code or e.g. URLs. The tool does not provide a document-level score.

Interpretation

To determine a document-level sentiment we compute −2∗#0−#1 + #3+2∗#4, where #0 denotes the number of sentences with score 0, etc.. If this score is negative, neutral or positive, we consider the text to be negative, neutral or positive, respectively.

3 Agreement Between Sentiment Analysis Tools

In this section we address RQ1 and RQ2, i.e., to what extent do the different sentiment analysis tools described earlier, agree with emotions of software developers and to what extent do different sentiment analysis tools agree with each other. To perform the evaluation we use the manually labeled emotions dataset (Murgia et al. 2014).

3.1 Methodology

3.1.1 Manually-Labeled Software Engineering Data

As the “golden set” we use the data from a developer emotions study by Murgia et al. (2014). In this study, four evaluators manually labeled 392 comments with emotions “joy”, “love”, “surprise”, “anger”, “sadness” or “fear”. Emotions “joy” and“love” are taken as indicators of positive sentiments and “anger”, “sadness” and “fear”—of negative sentiment. We exclude information about the “surprise” sentiment, since surprises can be, in general, both positive and negative depending on the expectations of the speaker.

We focus on consistently labeled comments. We consider the comment as positive if at least three evaluators have indicated a positive sentiment and no evaluator has indicated negative sentiments. Similarly, we consider the comment as negative if at least three evaluators have indicated a negative sentiment and no evaluator has indicated positive sentiments. Finally, a text is considered as neutral when three or more evaluators have neither indicated a positive sentiment nor a negative sentiment.

Using these rules we can conclude that 265 comments have been labeled consistently: 19 negative, 41 positive and 205 neutral. The remaining 392−265=127 comments from the study Murgia et al. (2014) have been labeled with contradictory labels e.g. “fear” by one evaluator and “joy” by another.

3.1.2 Evaluation Metrics

Since more than 77 % of the comments have been manually labeled as neutral, i.e., the dataset is unbalanced, traditional metrics such as accuracy might be misleading (Batista et al. 2000): indeed, accuracy of the straw man sentiment analysis predicting “neutral” for any comment can be easily higher than of any of the four tools. Therefore, rather than reporting accuracy of the approaches we use the Weighted kappa (Cohen 1968) and the Adjusted Rand Index (ARI) (Hubert and Arabie 1985; Santos and Embrechts 2009). For the sake of completeness we report the F-measures for the three categories of sentiments.

Kappa is a measure of interrater agreement. As recommended by Bakeman and Gottman (Bakeman and Gottman 1997, p. 66) we opt for the weighted kappa (κ) since the sentiments can be seen as ordered, from positive through neutral to negative, and disagreement between positive and negative is more “severe” than between positive and neutral or negative and neutral. Our weighting scheme, also following the guidelines of Bakeman and Gottman, is shown in Table 1. We follow the interpretation of κ as advocated by Viera and Garrett (Viera and Garrett 2005) since it is more fine grained than, e.g., the one suggested by Fleiss et al. (2003, p. 609). We say that the agreement is less than chance if κ ≤ 0, slight if 0.01≤κ ≤ 0.20, fair if 0.21≤κ ≤ 0.40, moderate if 0.41≤κ ≤ 0.60, substantial if 0.61≤κ ≤ 0.80 and almost perfect if 0.81≤κ ≤ 1. To answer the first research question we look for the agreement between the tool and the manual labeling; to answer the second one—for agreement between two tools.

Table 1 Weighting scheme for the weighted kappa computation

ARI measures the correspondence between two partitions of the same data. Similarly to the Rand index (Rand 1971), ARI evaluates whether pairs of observations (comments) are considered as belonging to the same category (sentiment) rather than on whether observations (comments) have been assigned to correct classes (sentiment). As opposed to the Rand index, ARI corrects for the possibility that pairs of observations have been put in the same category by chance. The expected value of ARI ranges for independent partitions is 0. The maximal value, obtained e.g., for identical partitions is 1, the closer the value of ARI to 1 the better the correspondence between the partitions. To answer the first research question we look for the correspondence between the partition of the comments into positive, neutral and negative groups provided by the tool and the partition based on the manual labeling. Similarly, to answer the second research question we look for correspondence between partition of the comments into positive, neutral and negative groups provided by different tools.

Finally, F-measure, introduced by Lewis and Gale (1994) based on the earlier E-measure of Van Rijsbergen (1979, p. 128), is the harmonic mean of the precision and recall. Recall that precision in the classification context is the ratio of true positivesFootnote 8 and all entities predicted to be positive, while recall is the ratio of true positives and all entities known to be positive. The symmetry between precision and recall, false positives and false negatives, inherent in the F-measure makes it applicable both when addressing RQ1 and when addressing RQ2. We report the F-measure separately for the three classes: neutral, positive and negative.

3.2 Results

None of the 265 consistently labeled comments produce SentiStrength results with p = −n and p ≥ 4. Three comments produce the “ERROR” status with Alchemy; those comments have been excluded from consideration. We exclude those comments from consideration and report κ and ARI for 262 comments.

Results obtained both for RQ1 and for RQ2 are summarized in Table 2 . Detailed confusion matrices relating the results of the tools and the manual labeling as well as results of different tools to each other are presented in Appendix A.

Table 2 Agreement of sentiment analysis tools with the manual labeling and with each other

3.3 Discussion

Our results clearly indicate that the sentiment analysis tools do not agree with the manual labeling and neither do they agree with each other.

RQ1

As can be observed from Table 2 both κ and ARI show that the tools are quite far from agreeing with the manual labeling: κ is merely fair, and ARI is low. NLTK scores best, followed by SentiStrength, and both perform better than Alchemy and Stanford NLP. Even when focusing solely on the positive and the negative sentiment, the F-values suggest that improving the F-value for the negative sentiments tends to decrease the F-value for the positive ones, and vice versa.

RQ2

Values of κ and ARI obtained when different tools have been compared are even lower when compared to the results of the agreement with the manual labeling. The highest value of κ, 0.25, has been obtained for Alchemy and Stanford NLP, and is only fair. Agreement between NLTK and SentiStrength is, while also only fair, the second highest one among the six possible pairs in Table 2.

To illustrate the reasons for the disagreement between the tools and the manual labeling as well as between the tools themselves we discuss a number of example comments.

Example 1

Our first example is a developer describing a clearly undesirable behavior (memory leak) in Apache UIMA. The leak, however, has been fixed; the developer confirms this and thanks the community.

To test this I used an aggregate AE with a CAS multiplier that declared getCasInstancesRequired()=5. If this AE is instantiated and run in a loop with earlier code it eats up roughly 10MB per iteration. No such leak with the latest code. Thanks!

Due to presence of the expression of gratitude, the comment has been labeled as “love” by all four participants of the Murgia’s study. We interpret this as a clear indication of the positive sentiment. However, none of the tools is capable of recognizing this: SentiStrength labels the comment as being neutral, NLTK, Alchemy and Stanford NLP—as being negative. Indeed, for instance Stanford NLP believes the first three sentences to be negative (e.g., due to presence of “No”), and while it correctly recognizes the last sentence as positive, this is not enough to change the evaluation of the comment as the whole.

Example 2

The following comment from Apache Xerces merely describes an action that has taken place (“committed a patch”).

D.E. VeloperFootnote 9 committed your patch for Xerces 2.6.0. Please verify.

Three out of four annotators do not recognize presence of emotion in this comment and we interpret this as the comment being neutral. However, keyword-based sentiment analysis tools might wrongly identify presence of sentiment. For instance, in SentiWordNet (Baccianella et al. 2010) the verb “commit”, in addition to neutral meanings (e.g., perpetrate an act as in “commit a crime”) has several positive meanings (e.g., confer a trust upon, “I commit my soul to God” or cause to be admitted when speaking of a person to an institution, “he was committed to prison”). In a similar way, the word “patch”, in addition to neutral meanings, has negative meanings (e.g.,, sewing that repairs a worn or torn hole or a piece of soft material that covers and protects an injured part of body). Hence, it should come as no surprise that some sentiment analysis tools identify this comment as positive, some other as negative and finally, some as neutral.

These examples show that in order to be successfully applied in the software engineering context, sentiment analysis tools should become aware of the peculiarities of the software engineering domain: e.g., that words “commit” and “patch” are merely technical terms and do not express sentiment. Our observation concurs with the challenge Novielli et al. (2015) has recognized in sentiment detection in the social programming ecosystem such as Stack Overflow.

3.4 A Follow-up Study

Given the disagreement between different sentiment analysis tools, we wonder whether focusing only on the comments where the tools agree with each other, would result in a better agreement with the manual labeling. Clearly, since the tools tend to disagree, such a focus reduces the number of comments that can be evaluated. However, it is a priori not clear whether a better agreement can be expected with the manual labeling. Thus, we have conducted a follow-up study: for every group of tools we consider only comments on which the tools agree, and determine κ, ARI and the F-measures with respect to the manual labeling.

Results of the follow up study are summarized in Table 3. As expected, the more tools we consider the less comments remain. Recalling that in our previous evaluation 262 comments have been considered, only 52.6 % remain if agreement between two tools is required. For four tools slightly more than 20 % of the comments remain. We also see that focusing on the comments where the tools agree improves the agreement with the manual labeling both in terms of κ and in terms of ARI. The F-measures follow, in general, the same trend. This means a trade-off should be sought between the number of comments the tools agree upon and the agreement with the manual labeling.

Table 3 Agreement of groups of tools with the manual labeling (n—the number of comments the tools agree upon)

3.5 Threats to Validity

As any empirical evaluation, the study presented in this section is subject to threats to validity:

  • Construct validity might have been threatened by our operationalization of sentiment polarity via emotion, recorded in the dataset by Murgia et al. (2014) (cf. the observations of Novielli et al. (2015)).

  • Internal validity of our evaluation might have been affected by the exact ways tools have been applied and the interpretation of the tools’ output as indication of sentiment, e.g., calculation of a document-level sentiment as −2∗#0−#1 + #3+2∗#4 for Stanford NLP. Another threat to internal validity stems form the choice of the evaluation metrics: to reduce this threat we report several agreement metrics (ARI, weighted κ and F-measures) recommended in the literature.

  • External validity of this study can be threatened by the fact that only one dataset has been considered and by the way this dataset has been constructed and evaluated by Murgia et al. (2014). To encourage replication of our study and evaluation of its external validity we make publicly available both the source code and the data used to obtain the results of this paper.Footnote 10

3.6 Summary

We have observed that the sentiment analysis tools do not agree with the manual labeling (RQ1) and neither do they agree with each other (RQ2).

4 Impact of the Choice of Sentiment Analysis Tool

In Section 3 we have seen that not only is the agreement of the sentiment analysis tools with the manual labeling limited, but also that different tools do not necessarily agree with each other. However, this disagreement does not necessarily mean that conclusions based on application of these tools in the software engineering domain are affected by the choice of the tool. Therefore, we now address RQ3 and discuss a simple set-up of a study aiming at understanding differences in response times for positive, neutral and negative texts.

4.1 Methodology

We study whether differences can be observed between response times (issue resolution times or question answering times) for positive, neutral and negative texts in the context of addressing RQ3. We do not claim that the type of comment (positive, neutral or negative) is the main factor influencing response time: indeed, certain topics might be more popular than others and questions asked during the weekend might lead to higher resolution times. However, if different conclusions are derived for the same dataset when different sentiment analysis tools are used, then we can conclude that the disagreement between sentiment analysis tools affects validity of conclusions in the software engineering domain.

Recent studies considering sentiment in software engineering data tend to include additional variables, e.g., sentiment analysis has been recently combined with politeness analysis (Danescu-Niculescu-Mizil et al. 2013) to study issue resolution time (Destefanis et al. 2016; Ortu et al. 2015). To illustrate the impact of the choice of sentiment analysis tool on the study outcome in presence of other analysis techniques, we repeat the response time study but combine sentiment analysis with politeness analysis.

4.1.1 Sentiment Analysis Tools

Based on the answers to RQ1 and RQ2 presented in Section 3.3 we select SentiStrength and NLTK to address RQ3. Indeed, NLTK scores best when compared to the manual labelling, followed by SentiStrength, and both perform better than Alchemy and Stanford NLP. Agreement between NLTK and SentiStrength, while also only fair, is still the second highest one among the six possible pairs in Table 2.

Moreover, we also repeat each study on the subset of texts where NLTK and SentiStrength agree. Indeed, Table 3 shows that these tools agree upon the largest subset of comments, achieving at the same time the highest among the two-tool combinations κ, ARI and the F-measure for neutral and negative class. We also observe that further improvement of the evaluation metrics is possible but at cost of significant drop in the number of comments.

4.1.2 Datasets

We study seven different datasets: titles of issues of the Android issue tracker, descriptions of issues of the Android issue tracker, titles of issues of the Apache Software Foundation (ASF) issue tracker, descriptions of issues of the ASF issue tracker, descriptions of issues of the Gnome issue tracker, titles of the Gnome-related Stack Overflow questions and bodies of the Gnome-related Stack Overflow questions. As opposed to the Android dataset, Gnome issues do not have titles. To ensure validity of our study we have opted for five datasets collected independently by other researchers (Android Issue Tracker descriptions and titles, Gnome Issue Tracker descriptions, ASF Issue Tracker descriptions and titles) and two dataset derived by us from a well-known public data source (Gnome-Related Stack Overflow question titles and bodies). All datasets are publicly available for replication purposes.Footnote 11 The descriptive statistics of the resolution/response times from these data-sets are given in Table 4.

Table 4 Descriptive statistics of resolution/response type

Android Issue Tracker

A dataset of 20,169 issues from the Android issue tracker was part of the mining challenge of MSR 2012 (Shihab et al. 2012). Excluding issues without a closing date, as well as those with bug_status “duplicate”, “spam” or “usererror”, results in the dataset with 5,216 issues.

We analyze the sentiment of the issue titles and descriptions. Five issues have an undetermined description sentiment. We remove these issues from further analysis on the titles and the descriptions. To measure the response time, we calculate the time difference in seconds between the opening (openedDate) and closing time (closedOn) of an issue.

Gnome Issue Tracker

The Gnome project issue tracker dataset containing 431,863 issues was part of the 2009 MSR mining challenge.Footnote 12 Similarly to the Android dataset, we have looked only at issues with a value for field bug_status of resolved. In total 367,877 have been resolved. We analyze the sentiment of the short descriptions of the issues (short_desc) and calculate the time difference in seconds between the creation and closure of each issue. Recall that as opposed to the Android dataset, Gnome issues do not have titles.

Gnome-Related Stack Overflow Discussions

We use the StackExchange online data explorerFootnote 13 to obtain all Stack Overflow posts created before May 20, 2015, tagged gnome and having an accepted answer. For all 410 collected posts, we calculate the time difference in seconds between the creation of the post and the creation of the accepted answer. Before applying a sentiment analysis tool we remove HTML formatting from the titles and bodies of posts. In the results, we refer to the body of a post as its description.

ASF Issue Tracker

We use a dataset containing data from the ASF issue tracking system Jira. This dataset was collected by Ortu et al. (2015) and contains 701,002 issue reports. We analyze the sentiments of the titles and the descriptions of 95,667 issue reports that have a non-null resolved date, a resolved status and the resolution value being Fixed.

4.1.3 Politeness Analysis

Similarly to sentiment analysis classifying texts into positive, neutral and negative, politeness analysis classifies texts into polite, neutral and impolite. In our work we use the Stanford politeness APIFootnote 14 based on the work of Danescu-Niculescu-Mizil et al. (2013). As opposed to sentiment analysis tools such as SentiStrength and NLTK, the Stanford politeness API has been evaluated on software engineering data: Stack Overflow questions and answers.

Given a textual fragment the Stanford politeness API returns a politeness score ranging between 0 (impolite) and 1 (polite) with 0.5 representing the “ideal neutrality”. To discretize the score into polite, neutral and impolite we apply the Stanford politeness API to the seven datasets above. It turns out that the politeness scores of the majority of comments are low: the median score is 0.314, the mean score is 0.361 and the third quartile (Q3) is 0.389. We use the latter value to determine the neutrality range. We say therefore that the comments scoring between 0.389 and 0.611 = 1 − 0.389 are neutral; comments scoring lower than 0.389 are impolite and comments scoring higher than 0.611 are polite.

4.1.4 Statistical Analysis

To answer our research questions we need to compare distributions of response times corresponding to different groups of issues. We conduct two series of studies. In the first series of studies we compare the distributions of the response times corresponding to positive, neutral and negative questions/issues. In the second series we also consider politeness and compare the distributions of the response times corresponding to nine groups obtained through all possible combinations of sentiment (positive, neutral and negative) and politeness (polite, neutral and impolite).

Traditionally, a comparison of multiple groups follows a two-step approach: first, a global null hypothesis is tested, then multiple comparisons are used to test sub-hypotheses pertaining to each pair of groups. The first step is commonly carried out by means of ANOVA or its non-parametric counterpart, the Kruskal-Wallis one-way analysis of variance by ranks. The second step uses the t-test or the rank-based Wilcoxon-Mann-Whitney test (Wilcoxon 1945), with correction for multiple comparisons, e.g., Bonferroni correction (Dunn 1961; Sheskin 2007). Unfortunately, the global test null hypothesis may be rejected while none of the sub-hypotheses are rejected, or vice versa (Gabriel 1969). Moreover, simulation studies suggest that the Wilcoxon-Mann-Whitney test is not robust to unequal population variances, especially in the case of unequal sample sizes (Brunner and Munzel 2000; Zimmerman and Zumbo 1992). Therefore, one-step approaches are preferred: these should produce confidence intervals which always lead to the same test decisions as the multiple comparisons. We use the \(\widetilde {\mathbf {T}}\)-procedure (Konietschke et al. 2012) for Tukey-type contrasts (Tukey 1951), the probit transformation and the traditional 5 % family error rate (cf. Vasilescu et al. 2013; Wang et al. 2014).

The results of the \(\widetilde {\mathbf {T}}\)-procedure are a series of probability estimates p(a, b) with the corresponding p-values, where a and b are representing the distributions being compared. The probability estimate p(a, b) is interpreted as follows: if the corresponding p-value exceeds 5 % then no evidence has been found for difference in response times corresponding to categories a and b. If, however, the corresponding p-value does not exceed 5 % and p(a, b) > 0.5 then response times in category b tends to be larger than those in category a. Finally, if the corresponding p-value does not exceed 5 % and p(a, b)<0.5 then response times in category a tends to be larger than those in category b.

We opt for comparison of distributions rather than a more elaborate statistical modeling (cf. Ortu et al. 2015) since it allows for an easy comparison of the results obtained for different tools.

4.1.5 Agreement Between the Results

Recall that sentiment analysis tools induce partition of the response times into categories. For every pair of values (a, b) the \(\widetilde {\mathbf {T}}\)-procedure indicates one of the three following outcomes: > (response times in category a tends to be larger than those in category b), < (response times in category b tends to be larger than those in category a) or ∥ (no evidence has been found for difference in response times corresponding to categories a and b). We stress that we refrain from interpreting lack of evidence for difference as evidence for lack of difference, i.e., we do not claim the distributions of response times corresponding to categories a and b are the same but merely that we cannot find evidence that these distributions are not the same. Hence, we also use ∥ (incomparable) rather than = (equal).

To compare the tools we therefore need to assess the agreement between the results produced by the \(\widetilde {\mathbf {T}}\)-procedure for partitions induced by different tools.

Example 3

Let \(\widetilde {\mathbf {T}}\)-procedure report “pos < neu”, “pos < neg” and “neu < neg” for partitions induced by Tool1, “pos < neu”, “pos < neg” and “neu ∥ neg” for partitions induced by Tool2, and “pos > neu”, “pos > neg” and “neu ∥ neg” for partitions induced by Tool3. Then, we would like to say that Tool1 agrees more with Tool2 than with Tool3, and Tool2 agrees more with Tool3 than with Tool1.

Unfortunately, traditional agreement measures such as discussed in Section 3.1.2 are no longer applicable since the number of datapoints (pairs of categories) is small: 3 for sentiment and 36 for the sentiment-politeness combination. Hence, we propose to count the pairs of categories (a, b) such that the \(\widetilde {\mathbf {T}}\)-procedure produces the same result for partitions induced by both tools (so called observed agreement).

Example 4

For Example 3 we observe that Tool1 and Tool2 agree on two pairs, Tool1 and Tool3 agree on zero pairs, and Tool2 and Tool3 agree on one pair.

We believe, however, that a disagreement between claims “response times in category a tends to be larger than those in category b” and “response times in category b tends to be larger than those in category a” is more severe than between claims “response times in category a tends to be larger than those in category b” and “no evidence has been found for difference in response times corresponding to categories a and b”. One possible way to address this concern would be to associate different kinds of disagreement with different weights: this is an approach taken, e.g., by the weighted κ (Cohen 1968). However, the choice of specific weights might appear arbitrary.

Hence, when reporting disagreement between the tools (cf. Tables 6 and 8 below) we report different kinds of disagreement separately, i.e., we report four numbers xyzw, where

  • x is the number of pairs for which the tools agree about the relation between the response times (>> or <<),

  • y is the number of pairs for which the tools agree about the lack of such a relation (∥∥),

  • z is the number of pairs when one of the tools has established the relation and another one did not (∥>, ∥<, <∥ or >∥),

  • w is the number of pairs when the tools have established different relations (<> or ><).

Example 5

Example 3, continued. We report agreement between Tool1 and Tool2 as 2 − 0 − 0 − 1, between Tool1 and Tool3 as 0 − 0 − 1 − 2, and between Tool2 and Tool3 as 0 − 1 − 0 − 2.

4.2 Results

Results of our study are summarized in Table 5. For the sake of readability the relations found are aligned horizontally. For each dataset and each tool we also report the number of issues/questions recognized as negative, neutral or positive.

Table 5 Comparison of NLTK and SentiStrength. Thresholds for statistical significance: 0.05 (), 0.01 (∗∗), 0.001 (∗∗∗). Exact p-values are indicated as subscripts; 0 indicates that the p-value is too small to be computed precisely. For the sake of readability we omit pairs for which no evidence has been found for differences in response times

We observe that NLTK and SentiStrength agree only on one relation for the Android, i.e., that issues with the neutral sentiment tend to be resolved more slowly than issues formulated in a more positive way. We also observe that for Gnome and ASF the tools agree that the issues with the neutral sentiment are resolved faster than issues with the positive sentiment, i.e., the results for Gnome and ASF are opposite from those for Android.

Further inspection reveals that differences between NLTK and SentiStrength led to relations “neu > neg” and “neg > pos” to be discovered in Android issue descriptions only by one of the tools and not by the other. In the same way, “pos > neg” on the ASF descriptions data can be found only by SentiStrength. It is also surprising that while “pos > neg” has been found for the ASF titles data both by NLTK and by SentiStrength, it cannot be found when one restricts the attention to the issues where the tools agree. Finally, contradictory results have been obtained for Gnome issue descriptions: while the NLTK-based analysis suggests that the positive issues are resolved more slowly than the negative ones, the SentiStrength-based analysis suggests the opposite.

Overall, the agreement between NLTK, SentiStrength and NLTKSentiStrength reported as described in Section 4.1.5 is summarized in Table 6.

Table 6 Agreement between NLTK, SentiStrength and NLTK \(\cap \) SentiStrength. See Section 4.1.5 for the explanation of the xyzw notation

Next we perform a similar study by including the politeness information. Table 7 summarizes the findings for Android. Observe that not a single relation could have been established both by NLTK and by SentiStrength. Results for Gnome, Stack Overflow and ASF are presented in Tables 1819 and 20 in the appendix. Agreement is summarized in Table 8: including politeness increases the number of categories to be compared to nine, and therefore, the number of possible category pairs to \(\frac {9*8}{2} = 36\). Table 8 suggests that while the tools tend to agree on the relation or lack thereof between most of the category pairs, the differences between the tools account for the differences in the relations observed in up to 30 % (11/36) of the pairs. Still, differences between the tools leading to contradictory results is relatively rare (two cases in Gnome, one in ASF titles and one in ASF descriptions), the differences tend to manifest as a relation being discovered when only one of the tools is used.

Table 7 Comparison of NLTK and SentiStrength in combination with politeness for the Android datasets. Thresholds for statistical significance: 0.05 (), 0.01 (∗∗), 0.001 (∗∗∗). Exact p-values are indicated as subscripts. Results for Gnome, Stack Overflow and ASF are presented in Tables 1819 and 20 in the appendix
Table 8 Agreement between NLTK, SentiStrength and NLTK \(\cap \) SentiStrength (politeness information included). See Section 4.1.5 for the explanation of the xyzw notation

4.3 Discussion

Our results suggest the choice of the sentiment analysis tool affects the conclusions one might derive when analysing differences in the response times, casting doubt on the validity of those conclusions. We conjecture that the same might be observed for any kind of software engineering studies dependent on off-the-shelf sentiment analysis tools. A more careful sentiment analysis for software engineering texts is therefore needed: e.g., one might consider training more general purpose machine learning tools such as Weka (Hall et al. 2009) or RapidMinerFootnote 15 on software engineering data.

A similar approach has been recently taken by Panichella et al. (2015) that have used Weka to train a Naive Bayes classifier on 2090 App Store and Google Play review sentences. Indeed, both dependency of sentiment analysis tools on the domain (Gamon et al. 2005) and the need for text-analysis tools specifically targeting texts related to software engineering (Howard et al. 2013) have been recognized in the past.

4.4 Threats to Validity

Validity of the conclusions derived might have been threatened by the choice of the data as well by the choice of the statistical machinery.

To reduce the threats related to the data, we have opted for seven different but similar datasets: the Stack Overflow dataset contains information about questions and answers, Android, Gnome and ASF—information about issues. We expect the conclusions above to be valid at least for other issue trackers and software engineering question & answer platforms. For Android, Gnome and ASF we have reused data collected by other researchers (Shihab et al. (2012), BirdFootnote 16 and Ortu et al. (2015), respectively). We believe the threats associated with noise in these datasets are limited as they have been extensively used in the previous studies: e.g., Asaduzzaman et al. (Asaduzzaman et al.) and Martie et al. (Martie et al.) used the Android dataset, Linstead and Baldi (2009) used the Gnome dataset, and Ortu et al. (2015) used the ASF dataset. The only dataset we have collected ourselves is the Stack Overflow dataset, and indeed the usual threats related to completeness of the data (questions can be removed) apply. Furthermore, presence of machine-generated text, e.g., error messages, stack traces or source code, might have affected our results.

Similarly, to reduce the threats related to the choice of the statistical machinery we opt for the \(\widetilde {\mathbf {T}}\)-approach (Konietschke et al. 2012) that has been successfully applied in the software engineering context (Dajsuren et al. 2013; Li et al. 2014; Sun et al. 2015; Vasilescu et al. 2013; Vasilescu et al. 2013; Wang et al. 2014; Yu et al. 2016).

5 Implications on Earlier Studies

In this section we consider RQ4: while the preceding discussion indicates that the choice of a sentiment analysis tool might affect validity of software engineering results, in this section we investigate whether this is indeed the case by performing replication studies (Shull et al. 2008) for two published examples. Since our goal is to understand whether the effects observed in the earlier studies hold when a different sentiment analysis tool is used, we opt for dependent or similar replications (Shull et al. 2008). In dependent replications the researchers aim at keeping the experiment the same or very similar to the original one, possibly changing the artifact being studied.

5.1 Replicated Studies

We have chosen to replicate two previous studies conducted as part of the 2014 MSR mining challenge: both studies use the same dataset of 90 GitHub projects (Gousios 2013). The dataset includes information from the top-10 starred repositories in the most popular programming languages and is not representative of GitHub as a wholeFootnote 17.

The first paper we have chosen to replicate is the one by Pletea et al. (2014). In this paper the authors apply NLTK to GitHub comments and discussions, and conclude that security-related discussions on GitHub contain more negative emotions than non-security related discussions. Taking the blame, the fourth author of the current manuscript has also co-authored the work by Pletea et al. (2014).

The second paper we have chosen to replicate is the one by Guzman et al. (2014). The authors apply SentiStrength to analyze the sentiment of GitHub commit comments and conclude that comments written on Mondays tend to contain a more negative sentiment than comments written on other days. This study was the winner of the MSR 2014 challenge.

5.2 Replication Approach

We aim at performing the exact replication of the studies chosen with one notable deviation from the original work: we apply a different sentiment analysis tool to each study. Since the original study of Pletea et al. uses NLTK, we intend to apply SentiStrength in the replication; since Guzman et al. use SentiStrength, we intend to apply NLTK. However, since the exact collections of comments used in the original studies were no longer available, we had to recreate the datasets ourselves. This lead to minor differences with the number of comments we have found as opposed to those reported in the original studies. Hence, we replicate each study twice: first applying the same tool as in the original study to a slightly different data, second applying a different sentiment analysis tool to the same data as in the first replication.

We hypothesize that the differences between applying the same tool to slightly different datasets would be small. However, we expect that we might get different, statistically significant, results in these studies when using a different sentiment analysis tool.

5.2.1 Pletea et al.

Pletea et al. distinguish between comments and discussions, collections of comments pertaining to an individual commit or pull request. Furthermore, the authors distinguish between security-related and non-security related comments/discussions, resulting in eight different categories of texts. The original study has found that for commits comments, commit discussions, pull request comments and pull request discussions, the negativity for security related texts is higher that for other texts. Comparison of the sentiment recognition using a sentiment analysis tool (NLTK) with 30 manually labeled security-related commit discussions were mixed. Moreover, it has been observed that the NLTK results were mostly bipolar, having both strong negative and strong positive components. Based on this observations the authors suggest that the security-related discussions are more emotional than non-security related ones.

In our replication of this study we present a summary of the distribution of the sentiments for commits and pull requests, recreating Tables 2 and 3 from the original study. In order to do this, we also need to distinguish security-related texts and other texts, i.e., we replicate Table 1 from the paper. We extend the original comparison with the manually labeled discussions by including the results obtained by SentiStrength.

5.2.2 Guzman et al.

In this study, the authors have focused on commit comments and studied differences between the sentiment of commit comments written at different days of week and times of day, belonging to projects in different programming languages, created by teams distributed over different continents and “starred”, i.e., approved, by different number of GitHub users.

We replicate the studies pertaining to differences between comments based on day and time of their creation and programming language of the project. We do not replicate the study related to the geographic distribution of the authors because the mapping of developers to continents has been manually made by Guzman et al. and was not present in the original dataset.

5.3 Replication Results

Here we present the results of replicating both studies.

5.3.1 Pletea et al.

We start the replication by creating Table 9, which corresponds to Table 1 from the paper by Pletea et al. We have rerun the division using the keyword list as included in the original paper. As explained above, we have found slightly different numbers of comments and discussions in each category. Most notably we find 180 less security-related comments in commits. However, the percentages of security and non-security related comments and discussions are similar.

Table 9 Identification of security-related comments and discussions results

To ensure validity of the comparison between NLTK and SentiStrength we have applied both tools to comments and discussions. On several occasions the tools reported an error. We have decided to exclude those cases to ensure that further analysis applies to exactly the same comments and discussions. Hence, in Table 9 we also report the numbers of comments and discussions excluded.

Next we apply NLTK and SentiStrength to analyze the sentiment of comments and discussions. Tables 10 and 11 present the results Tables 2 and 3 of the original paper, respectively, and extend them by including results of NLTK and SentiStrength on the current study dataset from Table 9. Inspecting Tables 10 and 11 we observe that the values obtained when using NLTK are close to those reported by Pletea et al., while SentiStrength produces very different results. Indeed, NLTK indicates that comments and discussions, submitted via commits or via pull requests, are predominantly negative, while according to SentiStrength neutral is the predominant classification.

Table 10 Commits sentiment analysis statistics. The largest group per study is typeset in boldface
Table 11 Pull Requests sentiment analysis statistics. The largest group per study is typeset in boldface

Despite those differences, the original conclusion of Pletea et al. still holds: whether we consider comments or discussions, commits or pull requests, percentage of negative texts among security related texts is higher than among non-security related texts.

Finally, in Table 4 Pletea et al. consider thirty security-related commit discussions and compare evaluation of the security relevance and sentiment as determined by the tools with the decisions performed by the human evaluator. The discussions have been selected based on the number of security keywords found: ten discussions labeled as “high” have been randomly selected from the top 10% discussions with the highest number of security keywords found, “middle” from the middle 10 % and “low” from the bottom 10 % of all security-related discussions.

Table 12 extends Table 4 (Pletea et al. 2014) by adding a column with the results of SentiStrength. Asterisks indicate the strength of the sentiment as perceived by the human evaluator.

Table 12 Case study results. Strength of the human-labeled sentiments has been labeled by Pletea et al. on a 5-star scale (Pletea et al. 2014)

By inspecting Table 12 we observe that NLTK agrees with the human evaluator in 14 cases out of 30; SentiStrength—in 13 cases out of 30 but the tools agree with each other only in 9 cases. We can therefore conclude that replacing NLTK by SentiStrength did affect the conclusion of the original study: results of the agreement with the manual labeling are still mixed.

We also observe that both for NLTK and for SentiStrength agreement in the “high” security group is lower than in the “low” security group.

Moreover, Pletea et al. have been observed that the NLTK results were mostly bipolar, having both strong negative and strong positive components, suggesting that security-related discussions are more emotional. This observation is not supported by SentiStrength that classifies 17 out of 30 discussions as neutral.

5.3.2 Guzman et al.

We classified all 60658 commit comments in the MSR 2014 challenge dataset (Gousios 2013) using NLTK.

In the original paper by Guzman et al. (2014) the authors claim to have analyzed 60425 commit comments, on the one hand, to have focused on comments of all projects having more than 200 comments, on the other. However, when replicating this study and considering comments of projects having more than 200 comments we have obtained merely 50133 comments, more then ten thousand comments less than in the original study. Therefore, to be as close as possible to the original study we have decided to include all commit comments in the dataset which produced 233 comments more than in the original study.

Guzman et al. start by considering six projects with the highest number of commit comments: Jquery, Rails, CraftBukkit, Diaspora, MaNGOS and TrinityCore. The authors present two charts to show the average sentiment score in those six projects and the proportions of negative, neutral and positive sentiments in commit comments. We replicate their study twice: first of all, using the same tool used by the authors (SentiStrength), and then using an alternative tool (NLTK).

Figs. 2 and 3 show the replication of the study of the average sentiment score in the six projects. The original figure from the work of Guzman et al. is shown in Fig. 1. Comparing Fig. 1 with Fig. 2 we observe that while the exact values of the averages are lower in the replication, the relative order of the projects is almost the same. Indeed, Rails is the most positive project, followed by MaNGOS and then the close values of Diaspora and TrinityCore, followed by Jquery and at last CraftBukkit. Differences between Figs 1 and 3 are more pronounced. Indeed, the average emotion score is more negative than in the original study for each project. Moreover, while Jquery and CraftBukkit are still the most negative projects, Rails is no longer positive or even least negative.

Fig. 1
figure 1

Emotion score average per project, using SentiStrength (Guzman et al. 2014)

Fig. 2
figure 2

Proportion of positive, neutral and negative commit comments per project, using SentiStrength (replication)

Fig. 3
figure 3

Emotion score average per project, using NLTK (replication)

Next we consider proportions of negative, neutral and positive sentiments. The original figure from the work of Guzman et al. is shown in Fig. 4, while Figs. 5 and 6) show the results of our replications. NLTK replication (Fig. 6) shows a larger proportion of negative commit comments than in the original paper (Fig. 4), which shows a larger proportion of negative commit comments than the SentiStrength replication (Fig. 5).

Fig. 4
figure 4

Proportion of positive, neutral and negative commit comments per project, using SentiStrength (Guzman et al. 2014)

Fig. 5
figure 5

Proportion of positive, neutral and negative commit comments per project, using SentiStrength (replication)

Fig. 6
figure 6

Proportion of positive, neutral and negative commit comments per project, using NLTK (replication)

Tables 13-15 contain the results from replicating the studies done in the study by Guzman et al. As above, we replicate those studies twice: using the same tool used by the authors (SentiStrength), and then using an alternative tool (NLTK).

In contrast to SentiStrength, NLTK outputs scores between 0 and 1 for negative, neutral and positive to indicate the probability of each sentiment. In the original paper, the SentiStrength scores are mapped to an integer in the range [−5,−1) for negative texts, 0 for neutral texts and in the range (1,5] for positive texts. In addition, negative scores were multiplied by 1.5 to account for the less frequent occurrence of negativity in human texts. Therefore, when using NLTK we apply a transformation to create numbers in the same range according to the following formula:

$$\textit{sentiment\_score} = \left\{\begin{array}{lll} (((\textit{neg} - 0.5)*(-6)) - 2)*1.5 &\text{if} \textit{ neg}\\ 0 & \text{if } neutral \\ ((\textit{pos} - 0.5)*6) + 2 & \text{if} \textit{ pos} \end{array}\right. $$

The formula maps numbers from the range given by NLTK to the range used by SentiStrength as well as multiplies negative comments by 1.5, as done in the study by Guzman et al.

We stress that we do not compare the sentiment values obtained using NLTK with those obtained using SentiStrength. Rather we compare sentiment values obtained for different groups of comments using the same tool and the same data set, and then observe (dis)agreement between the conclusions made. In Tables 1315 we replicate the sentiment scores grouped by programming language, weekday and time of the day. The original study reports the mean and the standard deviation. However, the mean can be unreliable (Vasilescu et al. 2011) and, therefore, we also report the median and the interquartile range IQR, Q 3Q 1.

Guzman et al. report that “Java projects tend to have a slightly more negative score than projects implemented in other languages”. As can be seen from Table 13, when the same tool (SentiStrength) has been applied to our data set a similar conclusion can be made. This is, however, not the case when NLTK has been applied: Table 13 shows a lower average emotion score for the C programming language than for Java. Also the median score for C is lower than for Java. We can therefore say that validity of this conclusion is not affected by the data set but is affected by the choice of the sentiment analysis tool.

Table 13 Emotion score average grouped by programming language

Furthermore, Guzman et al. report that the observation about Java has been statistically confirmed and that the statistical tests on the remaining programming languages (C, C++, JavaScript, PHP, Python and Ruby) did not yield significant results. The statistical test used is the Wilcoxon rank sum test. The authors compare seven programming languages and report that the corresponding p-values are less or equal to 0.002. We conjecture that the Bonferroni correction for multiple comparisons has been applied since 0.05/21 ≃ 0.0024.

When replicating this study we first of all exclude projects developed in languages other than the seven languages considered in the original study, and keep 55405 commit comments. Next we compare distributions corresponding to different programming languages. A more statistically sound procedure would have been the \(\widetilde {\mathbf {T}}\)-procedure discussed in Section 4.1.4. However, in order to keep our replication as close as possible to the original study, we also perform a series of pairwise Wilcoxon tests with the Bonferroni correction.

In the replication with SentiStrength we observe that (1) the claim that Java has more negative score than other languages is not confirmed (p-value for the (Java, C) pair is 0.6552) and (2) lack of statistically significant relation between other programming languages is not confirmed either (e.g., p-value for (C,C++) with the two-sided alternative is 6.9 × 10−12). Similarly, in the replication with NLTK neither of the claims of the original study can be confirmed.

Consider next the study of the sentiments grouped by the weekday. Guzman et al. report that comments on Monday were more negative than comments on the other days. Similarly to the study of programming languages, Table 14 suggests that a similar conclusion can be derived if SentiStrength is used but is no longer the case for NLTK. In fact, the mean NLTK score for Monday is the least negative. The median values both for SentiStrength and for NLTK are 0 for all the days suggesting no difference can be found. Then Guzmanet al. have performed a statistical analysis and compared Monday against each of the other days. This analysis “confirmed that commit comments were more negative on Monday than on Sunday, Tuesday, and Wednesday (p-value ≤0.015). We replicated this study with SentiStrength and observed that p ≤ 0.015 for Tuesday, Friday and Saturday. We can conclude that while the exact days have not been confirmed, at least we still can say that commit comments on Monday are more negative than those on some other days. Unfortunately, even a weaker conclusion cannot be confirmed if NLTK has been used: p exceeds the 0.015 for all days (in fact, p ≥ 0.72 for all days).

Table 14 Emotion score average grouped by weekday

Finally, Table 15 shows that NLTK evaluates the comments made in the afternoon as slightly more negative than comments in the evening, in contrast to SentiStrength that indicates the afternoon comments as the most positive, or at least the least negative ones. We could not replicate those results neither for SentiStrength nor for NLTK.

Table 15 Emotion score average grouped by time of the day

5.4 Discussion

When replicating the study of Pletea et al. we confirm the original observation that security comments or discussions are more often negative than the non-security comments or discussions. We also observe that the when compared with the manually labeled security discussions both tools produce mixed results. However, we could not find evidence supporting the suggestion that security-related discussions are more emotional.

When trying to replicate the results of Guzman et al. we could not derive the same conclusion when a different tool has been used. The only conclusion we could replicate when the same tool has been used is that the commit comments on Monday are more negative than those on some other days, which is a weakened form of the original claim. Recently Islam and Zibran (2016) have performed a similar study of the differences between emotions expressed by developers during different times and days of a week. Similarly to Guzman et al. Islam and Zibran have studied commit messages and used SentiStrength; as opposed Guzman et al. they have considered 50 projects with the highest number of commits from the Boa dataset (Dyer et al. 2013) rather than the 2014 MSR mining challenge dataset of 90 GitHub projects (Gousios 2013). In sharp contrast with the work of Guzman et al. no significant differences have been found in the developers’ emotions in different times and days of a week.

Our replication studies show that validity of conclusions of the previously published papers such as the ones by Pletea et al. (2014) and Guzman et al. (2014) should be questioned and ideally reassessed when (or if) a sentiment analysis tool will become available specifically targeting software engineering domain.

5.5 Threats to Validity

As any empirical study the current replications are subject to threats to validity. Since we have tried to follow the methodology presented in the papers being replicated as closely as possible, we have also inherited some of the threats to validity of those papers, e.g., that the dataset under consideration is not representative for GitHub as a whole. Furthermore, we had to convert the NLTK scores to the [−5,5] scale and this conversion might have introduced additional threats to validity. Finally, we are aware that the pairwise Wilcoxon test as done in Section 5.3.2 might not be the preferred approach from the statistical point of view: this is why a more advanced statistical technique has been used in Section 4. However, to support the comparative aspects of replication in Section 5.3.2 we present the results exactly in the same way as in the original work (Guzman et al. 2014).

6 Related Work

This paper builds on our previous work (Jongeling et al. 2015). The current submission extends it by reporting on a follow-up study (Section 3.3), replication of two recent studies (Section 5) as well presenting a more elaborate discussion of the related work below.

6.1 Sentiment Analysis in Large Text Corpora

As announced in the Manifesto for Agile Software Development (Beck et al. 2001), the centrality of developer interaction in large scale software development has come to be increasingly recognized in recent times (Datta et al. 2012; Schröter et al. 2012). Today, software development is influenced in myriad ways by how developers talk, and what they talk about. With distributed teams developing and maintaining many software systems today (Cataldo and Herbsleb 2008), developer interaction is facilitated by collaborative development environments that capture details of discussion around development activities (Costa et al. 2011). Mining such data offers an interesting opportunity to examine implications of the sentiments reflected in developer comments.

Since its inception, sentiment analysis has become a popular approach towards classifying text documents by the predominant sentiment expressed in them (Pang et al. 2002). As people increasingly express themselves freely in online media such as the microblogging site Twitter, or in product reviews on Web marketplaces such as Amazon, rich corpora of text are available for sentiment analysis. Davidov et al., have suggested a semi-supervised approach for recognizing sarcastic sentences in Twitter and Amazon (Davidov et al. 2010). As sentiments are inherently nuanced, a major challenge in sentiment analysis is to discern the contextual meaning of words. Pak and Patrick suggest an automated and language independent method for disambiguating adjectives in Twitter data (Pak and Paroubek 2010) and Agarwal et al., have proposed an approach to correctly identify the polarity of tweets (Agarwal et al. 2011). Mohammad, Kiritchenko, and Xiaodan report the utility of using support vector machine (SVM) base classifiers while analyzing sentiments in tweets (Mohammad et al. 2013). Online question and answer forums such as Yahoo! Answers are also helpful sources for sentiment mining data (Kucuktunc et al. 2012).

6.2 Sentiment Analysis Application in Software Engineering

The burgeoning field of tools, methodologies, and results around sentiment analysis have also impacted how we examine developer discussion. Goul et al. examine how requirements can be extracted from sentiment analysis of app store reviews (Goul et al. 2012). The authors conclude that while sentiment analysis can facilitate requirements engineering, in some cases algorithmic analysis of reviews can be problematic (Goul et al. 2012). User reviews of a software system in operation can offer insights into the quality of the system. However given the unstructured nature of review comments, it is often hard to reach a clear understanding of how well a system is functioning. A key challenge comes from “... different sentiment of the same sentence in different environment”. To work around this problem, Leopairote et al. propose a methodology that combines lists of positive and negative sentiment words with rule based classification (Leopairote et al. 2013). Mailing lists often characterize large, open source software systems as different stakeholders discuss their expectations as well as disappointments from the system. Analyzing the sentiment of such discussions can be an important step towards a deeper understanding of the corresponding ecosystem. Tourani et al. seek to identify distress or happiness in a development team by analyzing sentiments in Apache mailing lists (Tourani et al. 2014). The study concludes that developer and user mailing lists carry similar sentiments, though differently focused; and automatic sentiment analysis techniques need to be tuned specifically to the software engineering context (Novielli et al. 2015). Impact of the sentiment on issue resolution time, similar to RQ3 discussed in Section 4, have also been considered in the literature (Garcia et al. 2013; Ortu et al. 2015).

As mentioned earlier, developer interaction data captured by collaborative development environments are fertile grounds for analyzing sentiments. There are recent trends around designing emotion aware environments that employ sentiment analysis and other techniques to discern and visualize health of a development team in real time (Vivian et al. 2015). Latest studies have also explored the symbiotic relationship between collaborative software engineering and different kinds of task based emotions (Dewan 2015).

6.3 Sentiment Analysis Tools

As already mentioned in the introduction, application of sentiment analysis tools to software engineering texts has been studied in a series of recent publications (Garcia et al. 2013; Guzman et al. 2014; Guzman and Bruegge 2013; Novielli et al. 2015; Ortu et al. 2015; Panichella et al. 2015; Pletea et al. 2014; Rousinopoulos et al. 2014)

With the notable exception of the work of Panichella et al. (2015) that trained their own classifier on manually labeled software engineering data, all other works have reused the existing sentiment analysis tools. As such reuse of those tools introduced a commonly recognized threat to validity of the results obtained: those tools have been trained on non-software engineering related texts such as movie reviews or product reviews and might misidentify (or fail to identify) polarity of a sentiment in a software engineering artefact such as a commit comment (Guzman et al. 2014; Pletea et al. 2014).

In our previous work (Jongeling et al. 2015) and in the current submission we perform a series of quantitative analyses aiming at evaluation whether the choice of the sentiment analysis tool can affect the validity of the software engineering results. A complementary approach to evaluating the applicability of sentiment analysis tools to software engineering data has been followed by Novielli et al. (2015) that performed a qualitative analysis of Stack Overflow posts and compared the results of SentiStrength with those obtained by manual evaluation.

Beyond the discussion of sentiment analysis tools observations similar to those we made have been made in the past for software metric calculators (Barkmann et al. 2009) and code smell detection tools (Fontana et al. 2011). Similarly to our findings, disagreement between the tools was observed.

6.4 Replications and Negative Results

This paper builds on our previous work (Jongeling et al. 2015). The current submission extends it by reporting on replication of two recent studies (Section 5). There is an enduring concern about the lack of replication studies in empirical software engineering: “Replication is not supported, industrial cases are rare ... In order to help the discipline mature, we think that more systematic empirical evaluation is needed” (Tonella et al. 2007). The challenges around replication studies in empirical software engineering have been identified by Mende (2010). de Magalhães et al. analyzed 36 papers reporting empirical and non-empirical studies related to replications in software engineering and concluded that not only do we need to replicate more studies in software engineering, expansion of “specific conceptual underpinnings, definitions, and process considering the particularities” are also needed (de Magalhães et al. 2014). Recent studies have begun to address this replication gap (Sfetsos et al. 2012; Greiler et al. 2015).

One of the most important benefits of replication studies center around the possibility of arriving at negative results. Although negative results have been widely reported and regarded in different fields of computing since many years (Pritchard 1984; Fuhr and Muller 1987), its importance is being reiterated in recent years (Giraud-Carrier and Dunham 2011). By carefully and objectively examining what went wrong in the quest for expected outcome, the state-of-art and practice can be enhanced (Lindsey 2011; Täht 2014). We believe the results reported in this paper can aid such enhancement.

7 Conclusions

In this paper we have studied the impact of the choice of a sentiment analysis tool when conducting software engineering studies. We have observed that not only do the tools considered not agree with the manual labeling, but also they do not agree with each other, that this disagreement can lead to diverging conclusions and that previously published results cannot be replicated when different sentiment analysis tools are used.

Our results suggest a need for sentiment analysis tools specially targeting the software engineering domain. Moreover, going beyond the specifics of the sentiment analysis domain, we would like to encourage the researchers to reuse ideas rather than tools.