Knowledge-Based Systems

A forum or social media post can express multiple emotions, such as love, joy or anger. Emotion classification has been proven useful for measuring aspects such as user satisfaction. Despite its usefulness, research in emotion classification is limited, because the task is multi-label and publicly available data sets and lexica are very limited. A number of emotion classifiers for general-domain text have been proposed recently, but only a few for text in the domain of Open Source Software (OSS), such as EmoTxt. In this paper, we explore different lexica and two multi-label algorithms for classifying emotions in text related to OSS. We trained various multi-label classifiers using HOMER and RAkEL on a data set of Stack Overflow posts and a data set of JIRA Issue Tracker comments. The classifiers have been enriched with features derived from different state-of-the-art lexica. We achieved multi-label Micro F-scores up to 0.811 and Subset 0/1 Loss of 0.290. These results represent a statistically significant improvement over the state-of-the-art. © 2020TheAuthors.PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBYlicense


Introduction
Emotions, such as love, joy and anger, are complex states of mind caused by internal or external events [1]. For many years, they have attracted research interest in psychology [2]. Researchers in Natural Language Processing (NLP) have shown interest in their applications. One of them, emotion classification, is explored in this paper for the domain of Open Source Software (OSS).
In psychology, multiple theories have been proposed to understand emotions. During 1972, the first formal theory proposed six basic universal emotions: anger, disgust, fear, joy, sadness and surprise [3]. In 1980, the Wheel of Emotions model arranged the eight primary bipolar emotions in four axes: joy vs. sadness, anger vs. fear, trust vs. disgust and surprise vs. anticipation [4]. Secondary emotions are identified as intensity variants or combinations of the primary ones. In 1987, emotions were represented in a tree with six main branches: anger, fear, joy, love, sadness and surprise [5]. The branches could be bifurcated into secondary and ✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/ 10 tertiary branches to model sub-emotions. In 2012, the Hourglass of Emotions (HOE) [1], a theory based on the Wheel of Emotions was proposed. Emotions were modelled in four dimensions: pleasantness, attention, sensitivity and aptitude. Each dimension can have positive or negative polarity and a different activation level. Depending on how active it is, different emotions are represented.
In Computer Science and NLP, automatic emotion mining has attracted research attention. In particular, the following tasks have been popular [2]: A. Emotion detection: The task of determining whether a text conveys any emotion(s), without specifying which one(s). B. Emotion classification: the identification of particular emotions, such as love or sadness, triggered in a non-neutral text. C. Emotion cause detection: The task of determining the causes that stimulated the emotions expressed in a text.
Emotion classification, the most popular among these tasks [2], has been applied to quantify customer satisfaction of products or services [6], to prevent suicide [7] and to analyse newspapers articles [8] or tweets [9]. The classification of emotions in text related to OSS is less explored, as in [10,11].
In general, emotion classification is less popular than sentiment analysis, i.e. classification of text as positive, negative and neutral. Annotated data sets are few, small and rare due to their cost, subjectivity and exposure to disagreements [12]. In addition, there are multiple theories about emotions, as discussed previously. Resources for emotion classification, such as lexica, are scarce, probably because annotation is hard and expensive, and also due to the multitude of emotion theories. Finally, emotion classification is a multi-label problem, i.e. more than one emotion can be expressed in a piece of text, and it is more challenging than single-label ones.
This paper presents our experiments towards building an emotion classifier for OSS-related text. 1 We used two multi-label classification algorithms and various emotion lexica. Methods were evaluated on two different data sets, one consisting of JIRA Issue Tracker comments and one of Stack Overflow posts. Moreover, the performance of classifiers was compared against a random baseline, the most frequent label baseline and EmoTxt [11], a state-of-the-art tool. Experimental results show that the classification methods explored outperform the baselines. They also perform statistically better than EmoTxt, when trained and tested on JIRA comments. We did not observe a statistical difference between EmoTxt and classifiers trained and tested on Stack Overflow posts.
This paper is structured as follows: Sections 2 and 3 discuss motivation for this work and multi-label classifiers, respectively. Section 4 presents work related to automatic emotion classification. Our methodology is explained in Section 5, whereas data is discussed in Section 6. In Section 7, we discuss the experimental and evaluative settings. Results are presented and discussed in Sections 8 and 9, respectively. Some threats to the validity of our conclusions are analysed in Section 10. Section 11 concludes and proposes some future work.

Motivation
The motivation for this work is to create an emotion classifier for text related to OSS projects, i.e. forums posts, issue tracker messages and mailing lists, which in combination with other tools can help developers to analyse OSS projects in terms of elements, such as user experience or project management quality. As it will be shown in Section 4, most of the existing tools for emotion classification have been created for processing general domain texts. However, the words in text related to OSS projects can have different connotations or senses. For example, some general domain classifiers indicate that the phrase ''Every time I call this method, Java eats my RAM'' expresses emotions different from anger or sadness. 2 Therefore, it is essential to create an emotion classifier that knows how specific words are used in the OSS domain.
Specifically, this research is part of the CROSSMINER project [13], which targets to help OSS developers in creating complex software systems by enabling monitoring, analysing and assisting them to select components, such as libraries, while facilitating knowledge extraction from their repositories.

Multi-label classifiers
Multi-label classification is the process of classifying data instances, each of which may belong to one or more classes [14]. It contrasts single-label classification, in which an instance can only be categorised into one class. Due to the varying number of labels that have to be predicted for an instance, multi-label classification problems are considered harder than single-label ones [15]. Apart from emotion classification, other multi-label classification tasks concern genres of films or books, elements in images or music 1 The source code and data sets are publicly available and can be found in: styles in songs. Multi-label classifiers can be divided into three groups, following the approach they use [8]: A. Problem transformation Methods transform a multi-label task into multiple single label ones. There are two types of transformation: Binary Relevance and Label Powerset. In the former, each instance annotated with several labels is copied multiple times, each of which is annotated with one label. Label Powerset assigns a unique label to each multi-label combination. Usually, only combinations in the training data are considered. B. Ensemble Algorithms improve upon problem transformation methods by using multiple classifiers, trained on subsets of the original training data. Examples of these methods are HOMER [16,17], RAkEL [18] and ECC [19].  [23], use, at their core, transformations to solve the problem [24]. Moreover, in recent years the number of multilabel algorithms based on neural networks has increased. Examples in the domain of image processing are Wei et al. [25] and Wang et al. [26], and in text classification we can name FastText 3 [27] and Nam et al. [28]. Furthermore, recently there is research interest on what is known as Extreme Multi-label Classification, where thousands, even millions, of possible labels have to be processed [29][30][31][32].
In this work, we experimented with two methods: HOMER and RAkEL. Specifically, we used the implementations in Mulan [33], a multi-label extension of the machine learning library Weka [34].
The Hierarchy Of Multilabel classifiERs (HOMER) [16] is a machine learning algorithm that ''addresses a multi-label task by breaking down the entire label set recursively into several disjoint smaller sets that contain similar labels'' [17]. These smaller sets of labels are then used to train multiple multi-label classifiers, arranged hierarchically, and which only focus on smaller subclassification tasks. In its default instantiation, HOMER transforms the labels using Binary Relevance and uses internally C4.5 Decision Trees as the main classification algorithm.
RAndom k-labELsets (RAkEL) [18] is another machine learning algorithm that creates multiples multi-label classifiers. However, unlike HOMER, RAkEL splits the set of labels on disjoint subsets that are selected randomly and non-recursively. By default, RAkEL uses a Label Powerset transformation in an attempt to improve predictions by finding correlations between labels. Moreover, the default internal classification algorithm is a C4.5 Decision Tree.
Both HOMER and RAkEL can support any multi-label classification algorithm, such as BP-MLL or CLUS, or single-label classifiers with transformed labels, either with Binary Relevance or Label Powerset.

Related work
Recently, researchers have shown interest in automatic classification of emotions in text. In this section, we review the state-of-the-art.
The Affect Analysis Model [35] is an unsupervised emotion classification method. It relies on rules and a manually annotated database. Among others, it contains affective strength for emoticons, affect words, common abbreviations and acronyms. Feeler [36] is also unsupervised and based on the cosine similarity of high-dimensional vectors. Its features encode TF-IDF-weighted unigrams and are enriched using lexica, such as the WordNet Affect Lexicon [37]. The unsupervised method presented in [38] uses this last lexicon too and reduction tools, such as Latent Semantic Analysis and Non-negative Matrix Factorisation.
In contrast to these methods, supervised learning has been combined with a psychological approach in [39]. In particular, a Hidden Markov Model (HMM) was used to simulate how mental state sequences affect or cause emotions.
A multi-label classifier was employed to detect emotions in suicide notes [40]. It used Label Powerset and a one-vs.-all radial basis Support Vector Machine (SVM) that represented text using unigrams. The classifier detected 15 emotions, e.g. hopelessness and guilt, and also the lack of emotion. Many multi-label classifiers, e.g. BP-MLL, RAkEL and HOMER, have been evaluated for emotion identification in short Brazilian Portuguese texts [8]. Words that did not occur in a stoplist were weighted by TF-IDF and the SenticNet lexicon [41] was used.
Several state-of-the-art systems were proposed to address tasks in the Semantic Evaluation series (SemEval). Task 4 in Se-mEval 2007 was about classifying emotions and polarity of news headlines [42]. Out of three participants, the best-performing system in the emotion classification sub-task was UPAR7 [43], a rule-based system that uses dependency graphs enriched with information from the WordNet Affect Lexicon and SentiWord-Net [44]. Task 1 in SemEval 2018 [9] also focussed on emotion classification among others. Most participants used Convolution Neural Network (CNN), Recurrent Neural Network (RNN) or Long-Short Term Memory Network (LSTM) architectures, along with external resources such as lexica, word embeddings or word n-grams. The best classifier, NTUA-SLP [45], consists of a Bidirectional LSTM (BiLSTM) that uses attention and embeddings, trained on a large corpus of unlabelled tweets. As the task only provided a small training set, transfer learning was implemented by pretraining on the sentiment analysis corpus in Task 4 of SemEval 2017 [46].
Recently, there is research interest in the classification of emotions in texts related to software engineering and development. For instance, in Murgia et al. [47], the authors performed a qualitative and quantitative analysis regarding the feasibility of applying automatic classification techniques, such as Naïve-Bayes, SVM or k-Nearest Neighbours, for classifying emotions in issue comments.
In [10], emotions, such as anger, love, sadness and joy, conveyed in posts from the JIRA Issue Tracker 4 were detected by multiple Linear SVMs. It is not indicated if the multi-label task was addressed and how. Apart from text, features encode information from the WordNet Affect Lexicon, SentiStrength [49] and a politeness detection tool [50].
EmoTxt 5 [11] is an emotion classifier based on the principles in [10], separately trained on two corpora: JIRA Issue Tracker posts [48] and Stack Overflow comments [51]. EmoTxt uses six binary SVMs for classifying joy, love, sadness, anger, surprise, and fear following a one-vs.-all approach with Binary Relevance. 6 Each SVM can assign a specific emotion, only. Apart from the features in [10], the authors added TF-IDF. Although EmoTxt can be seen as multi-label if the output of all classifiers is merged, it was evaluated using uniquely single-label metrics, i.e. precision, recall and F-score. EmoTxt has been implemented in EMTk [52] and in EmoD [53], two toolkits that analyse emotions and sentiment in software engineering documents and data sources related to repositories. 4 The data set contains 4000 entries and is not described in detail. Most probably, same data set presented in [48].
5 EmoTxt is freely available: github.com/collab-uniba/Emotion_and_Polarity_ SO. 6 Models for surprise and fear were not generated on JIRA posts.
DEVA [54] is based on a bi-dimensional theory of emotions, in which excitement, stress, depression and relaxation are determined in accordance to arousal and valence values. To determine them, it uses dictionaries along with heuristics, such as the detection of exclamation marks, capital letters or interjections. DEVA was evaluated on a manually annotated corpus of ∼1800 JIRA comments. An improved version of DEVA uses machine learning [55], e.g. Adaptive Boosting and Gradient Boosting Tree, instead of lexicons and heuristics, only.
As discussed, most state-of-the-art emotion classifiers are unsupervised. Some explore supervised machine learning and mainly rely on single-label classifiers.

Methodology
We apply two multi-label classifiers, HOMER and RAkEL, and evaluate them comparatively on OSS-related text. We used the NLP4J 7 lemmatiser to explore if lemmatisation affects the outcome. We also investigate the use of lexical resources to enrich classification vectors. The vectors are composed mainly of word n-grams, skip-bigrams, 8 and extra text-based and lexiconbased heuristic features. The extra text-based features concern the number of: We also included binary features that represent the presence of meaningful symbols, next to the first and last token of a text: Features were extracted from three lexica: SenticNet 5 [41], the NRC Word-Emotion Association Lexicon [56] and the NRC Affect Intensity Lexicon [57]. The lexica were organised in two groups, and each contributed different features: A. NRC Emotion: This group consists of the NRC Word-Emotion Association Lexicon and the NRC Affect Intensity Lexicon. The former has been annotated by crowd-sourcing for emotions (anger, fear, anticipation, trust, surprise, sadness, joy and disgust) and polarity (positive or negative) associated to a list of words that come from other lexica. From this lexicon we obtain six features: the number of words related to four specific emotions (anger, fear, surprise, sadness and joy) and the number of neutral words, i.e words that appeared in the lexicon but were not linked to any polarity or emotion. The latter lexicon is a manually annotated collection of ∼6k words linked to their intensities about each emotion (anger, joy, sadness, fear). For each emotion in this lexicon, we calculate the number of words related to it in an instance, the average and maximum emotional strength and strength of the last emotional word. This group contributed a total of 22 features. B. SenticNet 5 is a collection of 100,000 entries, that range from unigrams to pentagrams, annotated according to the axes of the Hourglass of Emotions. SenticNet 5 also contains polarity annotations (positive or negative), polarity strength, moods, i.e. surprise, interest, disgust, and related concepts. It was generated using a LSTM neural network that extended previous SenticNet versions by discovering conceptual primitives, i.e. ensembles of verb-noun pairs. We calculate 27 features by matching word n-grams between text and the lexicon. 9 Eight concern polarity: the number of n-grams with polarity in a text, the average and maximum polarity strength, the strength of the last word with polarity. The remaining 19 features concern the axes of the Hourglass of Emotions: the number of n-grams in an instance related to the axe, their average, maximum and minimum strength and the strength of the last word related to the axe.
Term search and matching was done on lemmatised texts, regardless if the classifiers used lemmatisation to extract features. This was done to maximise the number of matching words and generate numerical features accurately.

Data sets
To the best of our knowledge, there are two data sets for emotion detection in the domain of OSS. The first is a collection of ∼5.8k JIRA Issue Tracker comments divided in three groups that correspond to varying levels of granularity, i.e. sentences vs. full comments [48]. The annotated emotions are joy, love, anger, sadness, surprise and fear; neutral instances are also included. This data was used in EmoTxt [11] and for a sentiment analysis tool [58]. The second data set contains 4.8k Stack Overflow posts [51], manually annotated with the same emotions. As all instances are assigned two labels or fewer, Table 1 shows the number of instances annotated with each combination, for each corpus.
Following EmoTxt [11], we aim at determining which emotions are present in text, and we did not consider the neutral instances. The non-neutral instances of both corpora were split using stratified sampling into training and test parts using an 80%-20% proportion. In the split process, we prioritised multilabel instances towards the training part. For instance, in JIRA, there is only one instance labelled with both anger and surprise, therefore, it was assigned to the training part.

Settings
We conducted 16 experiments, in which we use different multi-label classifiers, lexica for vector enrichment, lemmatisation settings and data sets for training and testing. Furthermore, we optimised parameters using Bayesian Optimisation [59], a lazy learning method that models the hyper-surface generated by an objective function and a parameter set. We optimised the following parameters: A. Word n-grams Represent contiguous character sequences linked by white-space, e.g. ''I am happy''. We considered n-grams of length one, two and three. B. Skip-bigrams A variation of n-grams, in which a gap of predefined size is skipped to generate bigrams. For example, in the phrase ''... a frequent coding issue'', ''frequent issue'' is a skip-bigram with one token gap. We considered skip-bigrams with one or two tokens gap and also no skip-bigrams at all. C. Minimum frequency of occurrence We explored a threshold in the range [1,50]. N-grams or skip-bigrams that exceed it are considered as features. 9 The number of n-grams, one to five, used in this matching is independent of the number of n-grams used by the machine learning algorithms.
D. Subsets As discussed in Section 3, HOMER and RAkEL subdivide the training data set and create multiple single-label classifiers that deal with smaller label sets. We optimised the number of subdivisions used by each method. These ranged between two and five, i.e. the number of labels minus one.
As Bayesian Optimisation objective function, we used the minimum between the median and the average of the multi-label Macro F-score (see Eq. (3)) calculated in a 10-fold cross validation setting. We have chosen the Macro F-score as it considers the proportion of labels in the data set. 10 Table 2 shows the parameters obtained by Bayesian Optimisation for all experiments.
As most multi-label classifiers, HOMER and RAkEL are probabilistic. Thus, to consider an emotion label as triggered, its probability needs to surpass a threshold. We used the default threshold set in HOMER and RAkEL, 0.5.
To extend evaluation, we compared HOMER and RAkEL against EmoTxt, which we trained on the JIRA and Stack Overflow data sets for love, joy, sadness, anger, surprise and fear. Its parameters were tuned using the integrated optimisation facility. For each instance, the predictions of all EmoTxt models were merged into a single vector, compatible with multi-label evaluation metrics.
We considered two baselines: assigning random labels and assigning the most frequent label, i.e. love, to all test instances. The random baseline consists of six boolean aleatory generators, that randomly determine which labels are activated in a prediction vector. 11 The scores are averaged over ten executions.
Multi-label classification is challenging, not only in developing methods but also evaluating results because ''it is difficult to tell which of the following mistakes is more serious: one instance with three incorrect labels vs. three instances each with one incorrect label'' [60].
Unlike single-label metrics, multi-label ones use vector sets instead of confusion matrices to represent all instances. Let us consider a corpus of n manually annotated instances. Let the vector sets T = {t 1 , .., t n } and P = {p 1 , .., p n } represent the ground truth and predictions, respectively. Each t i or p i , i ∈ [1..n], is a binary vector of length l equal to the number of possible labels; 1 denotes triggered labels and 0 inactive ones. Below, we discuss the metrics we employ.
Hamming Loss calculates how different the prediction is from the expected outcome. For each incorrect label prediction, 1 is added to the loss function: For this metric, all errors are equally important. Predicting anger instead of love or joy instead of love are equally wrong. Hamming Loss is affected by corpus imbalance, i.e. wrong prediction of infrequent labels can be under-estimated.
Subset 0/1 Loss counts predictions with at least one incorrect label as wrong: 10 Using a metric that does not consider the proportion of labels can be misleading. For example, a classifier that assigns the most frequent label to all instances can generate low values of Hamming Loss, because it hides classification errors for infrequent class instances. 11 The random baseline can generate null vectors, in which none of the emotions is triggered. Theoretically, a multi-label classifier cannot generate null vectors. In practice null vectors are possible, because probabilities are generated for each label independently. Unlike single-label classifiers, normalisation functions, such as softmax, are not applicable. Macro F-score 12 evaluates label prediction accuracy. It takes into account the proportion of each label class in the data set: Micro F-score 12 : evaluates how well on average labels and instances have been predicted: Instance F-score: assesses how well on average instances have been predicted: The range of Hamming and Subset 0/1 Loss values is [0, 1].
As they are loss functions, zero means that all predictions were correct. Macro, Micro and Instance F-score also range in [0, 1], however, one indicates perfect performance. Although multi-label metrics are suitable for this task, we use standard single-label metrics to evaluate the global performance of each classifier per emotion. Let I c (E) be the number of instances where E was predicted correctly, I p (E) the number of instances predicted with E and I a (E) the actual number of E instances. Precision and recall are defined as:  Table 3 shows the evaluation results for each classifier and the number of ''null'' vectors predicted. In null vectors, none of the emotions was predicted with a probability higher than the 0.5 threshold. Results for EmoTxt and the two baselines, discussed in Section 7, are also shown.

Results
We observe that models trained and tested on JIRA perform better than models trained and tested on Stack Overflow. The performance is lower for models trained and tested on different data set. The number of null vectors fluctuates remarkably, but most are predicted by models trained on Stack Overflow. SenticNet 5 produces the fewest null vectors when combined with HOMER.
Subset 0/1 Loss expresses the percentage of wrongly predicted instances. We can observe that the model with ID = 13 predicted at least one emotion wrongly in 29% of the JIRA test instances, whereas the model with ID = 15 predicted wrongly 44.2% of the Stack Overflow test instances. The random baseline predicted wrongly 98% of the instances in both test data sets. The Most Table 3 Evaluation results of the 16 emotions classifiers that were considered. The scores presented for the Random label baseline are the average of ten executions. Frequent Label baseline predicted wrongly at least 60% of the instances. In general, EmoTxt performs worse than HOMER and RAkEL methods when trained and tested on the same source. Otherwise, performance differences are not constant and no particular pattern can be observed.
All models outperform the random baseline, with respect to all metrics. Models trained and tested on the same source perform twice as well as the baseline in terms of Macro F-score, thrice for Micro F-score and Subset 0/1 Loss, and five times for Hamming loss. Similarly, these models perform better than the most frequent label baseline. The baselines generate less null vectors. However, this does not mean that the vectors they predict are always correct.
With respect to the statistical analysis of Subset 0/1 results for methods tested on JIRA, Cochran's Q Test showed that at least one method pair exhibits a statistically significant performance  Table 4 Results of applying a post hoc test over the outcomes obtained for methods tested on the Stack Overflow data set (top-right part) and on the JIRA data set (bottom-left part). A dash (-) designates no statistical difference found, * designates that p value < 0.05, • that p value < 0.01, and • that p value < 0.001. Cramér's V effect size is shown next to the p value . Table 5 Precision (P), Recall (R) and F-score (F 1 ), for every emotion, obtained by each classifier trained and tested on one specific corpus, either the JIRA data set or the Stack Overflow data set. SN stands for SentiNet. Post hoc tests did not find statistical differences between using lemmatisation and not, at least when models are tested on the same data set. 13 In those cases where there is statistical difference, the effect sizes were small. Effect sizes lesser than 0.10 mean that differences are trivial in practice or difficult to notice without further analysis. For instance, methods ID = 6 and ID = 5, tested on Stack Overflow, are statistically different (p value = 4.87 × 10 −2 ) but the effect size is only 0.09. Table 5 presents evaluation results, in terms of precision, recall and F-score, of models trained and tested on one specific corpus, either JIRA or Stack Overflow. In the JIRA part of Table 5, we observe that all models perform better than EmoTxt, especially in terms of F-Score. Most methods, including EmoTxt, have issues with predicting fear, except for HOMER using no lemmatisation and vectors enriched with SenticNet 5. EmoTxt has issues in predicting love and surprise, whereas for sadness and anger, it is the most precise. Furthermore, based on the number of null vectors presented in Table 3, EmoTxt shows that it is a conservative tool in general, which overall affects recall. In the Stack Overflow part of Table 5, EmoTxt is the most precise method for almost all emotions. However, EmoTxt achieves low recall, especially for joy or fear. Surprise is hard to predict for all methods, except for HOMER, which can predict some instances. The F-scores can explain why the Macro F-score values in Table 3 are low. Most methods fail to predict correctly at least one emotion, affecting Macro F-scores severely.

Discussion
A poor vocabulary intersection between JIRA and Stack Overflow data may be a reason why models trained on JIRA did not perform well when tested on Stack Overflow (see Table 3). We see this reason as not very probable, because the two data sets are from the same domain, software engineering. Another possibility is that people express themselves differently on the two means, although they belong to the same domain. For example, on Stack Overflow, people may be more straightforward and less emotionally expressive, than on JIRA, where discussions can easily get longer.
A further reason could be that annotators may perceive emotions differently. For the Stack Overflow data set, inter-annotator agreement is moderate, as the Fleiss' Kappa score ranges between 0.30 and 0.66 for different emotions, with an average of 0.47 [62], which means that the annotation was not easy and so is emotion classification.
Comparing the Instance F-scores and Subset 0/1 Loss values in Table 3, we can determine how precisely classifiers dealt with the emotions, and their multi-label aspect. For example, in Stack Overflow, model ID = 16 predicts emotions more accurately than model ID = 15. However, the latter predicts more instances correctly based on Subset 0/1 Loss values. This indicates that ID = 16 manages the multi-label aspect better, but is less precise than ID = 15 in detecting emotions. Model ID = 13 is best for predicting emotions in JIRA, because it achieved the highest Instance F-score and the lowest Subset 0/1 Loss.
All models fail to predict fear and surprise correctly because they are the least frequent. The Stack Overflow data set contains more of these instances than JIRA, however, evidently not enough. For solving this issue, we could use a classification algorithm that has been designed for dealing with class imbalance rather than HOMER's and RAkEL's default algorithm, a C4.5 Decision Tree. For instance, DECOC (Diversified Error Correction Output Codes) [63] is an algorithm that follows the ideas of Error-Correcting Output Codes (ECOC ) [64], i.e. to use multiple combinations of binary classifiers that are merged before the final output, but that 13 For observing this pattern in Table 4 use the following coordinates (x, y).
For the top-right part: uses different weights in order to prioritise minority classes. This weighted approach has shown to perform better than other similar imbalance classification algorithms [63].
Another possible solution for the latter problem, is to make use of an oversampling algorithm, such as SMOTE (Synthetic Minority Over-sampling Technique) [65]. However, rather than applying it to the whole training data set before passing it to the classifiers, we could embed it into HOMER and RAkEL. Specifically, HOMER and RAkEL generate internally subsets of labels, thus, we could apply SMOTE to oversample the less frequent labels within these label subsets and, in consequence, improve the performance of the classifiers. This approach would be similar to the one proposed in [66], where they embed SMOTE into an AdaBoost SVM algorithm to improve the classification of imbalance classes.
To interpret the results of the proposed models and EmoTxt, in Table 6 we manually analysed some incorrectly predicted instances. In example A, some classifiers predicted all emotions correctly, while others only predicted some or none. In example B, most classifiers predicted love instead of joy, probably because love was assigned to most JIRA instances that contain the word ''thanks'', e.g. ''Wow, fast. Thanks!'' or ''Thanks, Ashish!''.
In example C, only models trained on Stack Overflow (ID = 18,11,12) assigned the correct emotion. We believe that this text may be sarcastic, expressing sadness or anger. In examples D and E opposite emotions were predicted, e.g. love vs. anger. Due to the short length of these two examples, it is hard to determine which words or elements activated the emotions. In F, some classifiers assigned more than one emotion, sometimes other than the actual annotated emotion. Depending on the message context, wrongly predicted emotions may seem relevant. If the context is criticism on Windows and the praise of Unix, F may express sadness or anger, apart from surprise. If F is related to a severe Linux bug, it can be sarcastic and, in consequence, may just express surprise.
In example G, most classifiers correctly predicted anger, but Examples I and J show that classifiers cannot distinguish love and joy well. Example K is incomplete, thus a part of the context is lost. We suspect that the missing text was in code tags, which are frequent in Stack Overflow, and it was removed. Text in code tags should not represent natural language. Finally, example L shows how positive words, e.g. hope, can mislead emotion prediction. Related to the last point, it would make sense to integrate a negation management tool, similar to those used in sentiment analysers. It would prevent from predicting opposite emotions, e.g. fear instead of joy, as in example L in Table 6. The integration may be complex as it is hard to compute which emotion are negated. For instance, the phrase ''I'm not afraid of what will happen to GitHub after being purchased by Microsoft.'' is hard to annotate for emotions even for humans. For lexicon-based classifiers, ''afraid'' may trigger fear. However, due to the negation it should be handled differently.
Reverting elongated words into their original form can be complicated. It implies detecting which words are truly elongated, e.g. ''looove'' vs. ''issue''. Thus, rules about words that contain repeated letters, should be applied. Then, the rules need to be validated against a dictionary. However, elongated words can represent out-of-dictionary words, e.g. superlatives, software names and acronyms, making validation harder. Moreover, disambiguation is needed to determine which word matches the Table 6 Examples of instances, from both testing data sets, on which the classifiers had problem to correctly predict the emotions.

Example
Actual emotions Prediction  (8) Surprise (15) Joy (18) context, in cases such as ''os'' (operating system) and ''oss'' (open source software). EmoTxt performed significantly different than in [11] for some emotions. For models trained and tested on JIRA, the maximum F-score difference is for joy, 57.3% vs. 86% in [11]. The reasons remain unclear. The corpus in [11] is smaller, but contains the a similar number of neutral instances. We encountered 362 instances of joy, 834 of love and 457 of sadness, whereas 124, 166 and 302 are respectively declared in [11]. It is also mentioned that the JIRA data set does not contain fear or surprise instances, contradicting the description of JIRA [48] and our findings. The reported data size in [48] including neutral instances is 5992, whereas we counted 5830. In [11], 4916 instances are declared, of which 4000 are neutral. It is not clear why the data set was truncated and how it was split for training and testing EmoTxt. This may have affected the results.
We also observed F-score differences for all emotions against EmoTxt trained and tested on Stack Overflow in [11]. For joy, we obtained 30.4% instead of 77%, 80.4% vs. 92% for love, 50.6% vs. 79% for sadness, 73.4% vs. 86% for anger, 0% vs. 58% for surprise and 24% vs. 86% for fear. We noticed small differences in the number of instances labelled with each emotion in our corpora and the one used in [11], e.g. 1200 vs. 1220 for love, or 106 vs. 104 for fear.
Another possible reason for the disagreement regarding EmoTxt performance in this work and in [11] can reside on how the data was split for training and testing. As we focused on the multi-label aspect, we split both data sets considering all labels assigned to an instance, i.e. using stratified sampling. For example, we see the 71 Stack Overflow instances, labelled with joy-love (Table 1), as different from instances labelled with joy or love, only. Thus, our training data contains 56 joy-love instances, 904 love-only instances and 323 joy-only ones. This consideration, along with the moderate inter-annotator agreement, may explain the F-score differences.
The lack of statistical difference between models that used or excluded lemmatisation may suggest that minor feature variations do not affect emotion classification. This may also hold for the number of n-grams and skip-bigrams. Similarly, the lack of statistical difference between using SenticNet 5 or NRC lexica may indicate that, despite the variation of theories or annotations, emotions are represented equally. Moreover, model parameters are tuned using Bayesian Optimisation, maximising performance, despite using varying features.
Two statistically indifferent models may not perform exactly the same. Maybe the test data was not large enough to reveal a difference. However, testing on larger data does not guarantee that the difference will be observable. The effect size may be very small, meaning that in practice, despite a statistical difference, the performance will be similar.
We expected that SenticNet 5 would boost performance, due to its large size and its OSS-related terms, e.g. ''memory leak'' and ''open source''. It seems that the number of lexicon entries is less important than their quality and annotation calibre. We may need to represent lexicon information into more complex features.
Investigating the correlation of labels, e.g. love and joy, can improve multi-label classification performance. This task is not straight-forward for machine learning algorithms. In methods based on Binary Relevance, such as HOMER, label dependency is lost, since each label is considered separately. Methods based on Label Powerset, such as RAkEL, preserve the correlation between labels, as multi-label instances contribute new, composite labels. However, this action can reduce the label density in training data sets, i.e the number of instances decreases with respect to the number of total possible labels [24]. Furthermore, Label Powerset sees the correlation of labels as a conditional dependency, which may not be always true [14].

Threats to validity
Trained models may not perform consistently on other data sets. Performance may be affected by domain variations, annotators and diverse annotation guidelines. This has been observed in Section 8, where models trained on a data set do not perform well when tested on another, even in the same domain.
Parameter settings different from the optimised ones may lead to better performance. Bayesian Optimisation was chosen for its excellent balance between speed and quality. Finer models could have been used, at the cost of longer optimisation time. We enriched classification features with good indications of emotions in text, in our view. Different extra features may improve performance.

Conclusions and future work
Emotion classification is the task of determining which emotions are expressed in non-neutral text. The task is complex because text can express multiple emotions. The state-of-theart lacks freely available annotated data and external resources and only offers few classification tools, especially in the domain of Open Source Software (OSS). We explored two multi-label classifiers, i.e. HOMER and RAkEL, and lexica, i.e. SenticNet 5, the NRC Word-Emotion Association Lexicon and the NRC Affect Intensity Lexicon, to develop an emotion classifier for text related to OSS. The classifiers were evaluated on collections of JIRA Issue Tracker comments and Stack Overflow posts. We evaluated against EmoTxt, a state-of-the-art emotion classifier for OSSrelated text, a random baseline and the most frequent class one. We used multi-label and single-label metrics, as well as statistical significance testing.
HOMER and RAkEL models outperformed both baselines. They perform statistically better than EmoTxt, when trained and tested on the JIRA data set. The effect size of the performance difference between EmoTxt and the proposed methods is small or small-medium. In general, our models achieved multi-label Micro F-scores, multi-label Macro F-scores and Subset 0/1 Loss up to 81.1%, 59% and 29%, respectively. When trained and tested on Stack Overflow, our models performed similarly to EmoTxt and no statistical difference was found. This means that it is necessary to perform further comparisons to determine a statistical difference, although they may show that the effect size of the difference is imperceptible or trivial. We conclude that using either HOMER or RAkEL does not affect the results significantly. Similarly, the size of lexica did not affect performance either. Thus, maybe the quality of lexica annotations rather than size is key in improving performance.
In the future, we plan to explore other classification methods. For example, we would like to determine how an algorithm such as Stochastic Gradient Boosting Trees, which has been observed to perform consistently good on different data sets [67], could behave in multi-label tasks when used alone (with transformed labels) or as the main classifier of RAkEL and HOMER. We contemplate to investigate algorithms such as Diversified Error Correcting Output Codes [63] to deal with the imbalance of specific emotions. In addition, we would like to explore how an embedded version of SMOTE, such as in [66], could improve the performance of HOMER and RAkEL. We would also experiment with algorithms that have been specifically designed for multi-label tasks, such as FastText [27].
The inclusion of other external resources based on word embedding or other lexica, could help in improving performance. Related to this, the exploration of new extra features for vector enrichment, as well as the inclusion of methods for detecting and managing negations, may improve performance to a great extent. Finally, the use of neural networks along with transfer learning, from another domain, may also contribute to improve the performance of an emotion classifier for texts related to Open Source Software.