Confirming the Non-compositionality of Idioms for Sentiment Analysis

An idiom is defined as a non-compositional multiword expression, one whose meaning cannot be deduced from the definitions of the component words. This definition does not explicitly define the compositionality of an idiom’s sentiment; this paper aims to determine whether the sentiment of the component words of an idiom is related to the sentiment of that idiom. We use the Dictionary of Affect in Language augmented by WordNet to give each idiom in the Sentiment Lexicon of IDiomatic Expressions (SLIDE) a component-wise sentiment score and compare it to the phrase-level sentiment label crowdsourced by the creators of SLIDE. We find that there is no discernible relation between these two measures of idiom sentiment. This supports the hypothesis that idioms are not compositional for sentiment along with semantics and motivates further work in handling idioms for sentiment analysis.


Introduction
The processing of multiword expressions (MWEs) is an underrecognized subfield of natural language processing research.A multiword expression is defined as a phrase that can be decomposed into multiple lexemes and shows lexical, syntactic, semantic, pragmatic, or statistic idiosyncrasy (Baldwin and Kim, 2010), where a lexeme is a linguistic unit that constitutes the basic block of a language (Ramisch, 2015).MWEs are prevalent in modern text and increasing in frequency as modern language develops-Jackendoff (1997) estimates that the number of MWEs in a speaker's lexicon is roughly equivalent to the number of single words, and 44% of entries in WordNet 3.0 are multiword (Miller, 1995), a 3% increase from WordNet 1.7 (Sag et al., 2002).Ignoring MWEs when analyzing natural speech can result in models that cannot handle variation or fail to generalize, and relying on complicated preprocessing or ad hoc methods of handling MWEs creates systems that are difficult to maintain or extend (Sag et al., 2002).
Idioms, a subset of MWEs, are particularly challenging to analyze because they are noncompositional: the meaning of the entire idiom cannot be deduced from the definitions of each individual word in it (Jochim et al., 2018).Treating idioms like "it's raining cats and dogs" with a words-with-spaces approach can diminish the accuracy of a model that treats each word as the smallest unit of a sentence; the example idiom simply means that it is raining heavily and is unrelated to animals.Along with meaning, past work has already shown that ignoring idioms in sentiment analysis tasks will lower the accuracy of a sentiment classifier (Williams et al., 2015), but the non-compositionality of idiom sentiment is not included in the currently acknowledged definition of an idiom and should not be immediately assumed without further research.
The goal of this paper is to confirm or deny the non-compositionality of idiom sentiment.Some idioms, like "a blessing in disguise," "so far so good," "in the red," and "add insult to injury," show potential compositionality of sentiment based on the positive sentiments of "blessing" and "good" and negative sentiments of "red," "insult," and "injury."Other examples, like "break a leg," "speak of the devil," and "let the cat out of the bag," would imply the wrong sentiment based on the negative sentiment in "break" and "devil" and lack of strong polar sentiment in any of the words "let," "the," "cat," "out," "of," and "bag."Based on the definition of an idiom, that the collective meaning of component words does not predict the meaning of the entire phrase, we hypothesize that the sentiment of an idiom is noncompositional.We test this hypothesis by comparing two scores for each idiom in the Senti-ment Lexicon of IDiomatic Expressions (SLIDE): a DAL sentiment score based on each word in the idiom and a SLIDE positive percent index given by the lexicon.
2 Related Work Williams et al. (2015) explore how much the inclusion of idioms as features improve traditional sentiment classification and provide a set of 580 idioms annotated with sentiment polarity and a corpus of sentences containing idioms in context.Each sentence was labeled with an emotion and the authors compared models that predicted the gold standard by including and excluding separate treatment of idioms.When comparing the results, they noted significant improvement in F-score for all three sentiment classes: positive, negative, and other.The results of Williams et al.'s work demonstrate the need to include additional methods for handling idioms in sentiment analysis.Ramisch and Villavicencio (2018) define the linguistic characteristics of MWEs and discuss how to incorporate MWEs into language technology.Savary et al. (2017) produce a multilingual 5-million-word annotated corpus of verbal MWEs (such as "to break one's heart") and annotation guidelines for eighteen languages.Seretan ( 2008) provides a syntax-based methodological framework for automatically identifying idiomatic collocations in text corpora.Many neural models of sentiment, like the one used by Socher et al. (2013), assume that sentiment is compositional.Zhu et al. (2015) incorporate both compositional and non-compositional sentiment by using an automatic labeling method for the noncompositionality of n-grams while we focus on annotated idioms.Jochim et al. (2018) present SLIDE, the Sentiment Lexicon of IDiomatic Expressions.SLIDE is a collection of 5,000 idiomatic expressions, a great expansion from Williams et al.'s set of 580 idioms.Jochim et al. used CrowdFlower to have at least ten annotators label each idiom as positive, negative, neutral, or inappropriate.The lexicon includes the distribution of annotations and a sentiment label that represents the label that received the majority of votes.In the case of a tie between positive/negative and neutral, the idiom is labeled positive/negative; in the case of a tie between positive and negative, the idiom is labeled neutral.The SLIDE polarity annotations were critical for the endeavors of this paper.
To compute sentiment scores for idioms based on each component word, we relied on the technique developed by Agarwal et al. (2009) to detect phrase-level polarity.They derived lexical scores for pleasantness, activation, and imagery from the Dictionary of Affect in Language (M.Whissel, 1989) augmented by WordNet (Miller, 1995), used a finite state machine to handle local negations, and boosted scores to capture the strength of words that may have otherwise received similar pleasantness scores-consider the difference between "fairly good advice" and "excellent advice," for example.We implemented their method of computing sentiment scores to compare to phrase labels provided by SLIDE.

SLIDE Positive Percent Index and Sentiment Label
We used the Sentiment Lexicon of IDiomatic Expressions (SLIDE) (Jochim et al., 2018) to give each idiom a positive percent index and sentiment label.The sentiment labels were given by the lexicon as a majority vote of at least ten crowdsourced annotations per idiom, and only idioms that are labeled positive (946), negative (1,108), or neutral (2,945) were used in this study, for a total of 4,999 idioms.The full dataset was used for analysis.The positive percent index was calculated by subtracting the percentage of negative votes from the percentage of positive votes.This system of quantitatively evaluating sentiment emphasizes the positive score of an idiom without distinguishing neutral and negative sentiment.In this study, we focus on positive sentiment; alternatives include calculating negative or neutral percent indices or subtracting just the negative percentage of votes to capture the nuances of sentiment strength.

Component-wise Idiom Scoring
We compute component-wise scores by implementing Agarwal et al.'s method of measuring phrase-level polarity (Agarwal et al., 2009).These scores represent the compositional sentiment of an idiom.We begin by tokenizing the idiom (Honnibal and Montani, 2017) and assigning each word a pleasantness score from the Dictionary of Affect in Language (DAL) (M.Whissel, 1989); if the word is not present in the DAL, we use the pleasantness score for a synonym or the negated pleasantness score for an antonym from WordNet (Miller, 1995).We consider each word sense from WordNet in order, which is based on the frequency of use, and use the first sense that had a DAL entry.The scores are Z-normalized according to the mean and standard deviation of each sentiment class given in the manual for the DAL and boosted by multiplying by the number of standard deviations they lie from the mean.
We then handle local negations with a finite state machine of two states: RETAIN and IN-VERT.The scores remain the same when the finite state machine is in the RETAIN state and are negated when in the INVERT state.Each idiom starts in the RETAIN state and switches to the IN-VERT state when a negation, like "not," "no," and "never," is encountered.The finite state machine returns to the RETAIN state if it encounters the word "but" or a comparative degree adjective, like "better" or "worse," to account for phrases like "no better than evil."The idiom's component-wise score is the sum of the scores for each component word normalized by the length of the idiom.

Results and Discussion
We have computed the Spearman correlation between the predicted and gold labels and p-values for each sentiment class, with the null hypothesis that two sets of data are not correlated.The Spearman correlation of each sentiment class is close to 0, which implies no correlation, and we fail to reject the null hypothesis for idioms labeled neutral and negative.Even though p ≤ 0.05 for idioms labeled positive, the near-zero Spearman correlation of −0.144 still indicates no correlation between predicted and gold labels.These values further support our claim that idioms are noncompositional for sentiment.When plotted against the crowdsourced sentiment distribution from SLIDE, the componentwise sentiment scores show no obvious pattern (see Figure 1).In total, 19% of idioms were labeled positive, 22% labeled negative, and 59% labeled neutral.

Spearman corr. p-value
The SLIDE positive percent indices range from -1.0, which means that no annotators labeled the idiom positive, to 1.0, which means that all annotators labeled it positive.Figure 1 shows clear separation between idioms labeled positive (•) and idioms labeled negative (2) but does not distinguish between negative and neutral (×), as expected.It does, however, show the lack of obvious correlation between the crowdsourced positive percent index (horizontal axis) and computed DAL positive index (vertical axis).positive index to be directly related, but we can see from Figure 1 that idioms with the highest SLIDE positive percent rating do not strictly correspond to a higher DAL positive index.In fact, there seems to be no relationship between SLIDE positive percent and DAL positive index at all.In Figure 1, we can see no distinct pattern between the two measurements of phrase sentiment.
Furthermore, even though the SLIDE positive percent index poorly distinguishes between idioms with majority negative and neutral votes, we would expect to see consistently lower DAL positive indices for idioms labeled negative than idioms labeled neutral.Negatively labeled idioms do have a noticeably lower mean DAL positive index but a much larger standard deviation than neutral idioms.Surprisingly, positively labeled idioms have an even lower mean DAL positive index than negatively labeled idioms, with a comparable standard deviation.It is interesting that negatively and positively labeled idioms (idioms that express some emotion) both display much lower mean values and much greater standard deviations of DAL positive index scores while neutral (unemotional) idioms tend to vary less.This may indicate that emotional idioms contain emotional words, but the sentiment of the words does not necessarily correlate to the sentiment of the entire phrase.

Conclusion and Future Work
Our analysis shows that there is no consistent correlation between component-wise sentiment scores and crowdsourced phrase-level labels, which supports the hypothesis that idioms are noncompositional for sentiment as well as meaning.The non-compositionality of sentiment was not explicitly defined or immediately obvious for idioms, and the lack of relationship between component words and phrase-level sentiment motivates further research in handling idioms in context.Multiword expressions in general are very common and increasing in frequency in modern language, and we have demonstrated that treating MWEs as words-with-spaces rather than separate, complete entities can lead to inconsistent results in sentiment labeling.
Possible future work in the sentiment analysis of MWEs include learning domain-specific sentiment without manual annotation, like predicting a negative sentiment for the phrase "high blood pressure" in the context of a poor health condition.Work must also be done in recognizing new MWEs as language evolves, as well as associating new meanings to already existing words and phrases.This is particularly important for process-ing Internet slang, which evolves and generates new vocabulary very quickly through social media.For example, the saying "yeet haw," a combination of the words "yeet" and "yeehaw," which are both casual expressions of excitement, has risen in occurrence.Manually annotating common idioms, as the creators of SLIDE had Crowd-Flower workers do, is a tedious, time-consuming, and never-ending task as long as language keeps changing.Learning to recognize and associate proper sentiment scores to MWEs is an important step in improving overall sentiment classification.

Figure 1 :
Figure 1: Component-wise sentiment score vs. SLIDE positive percent index with sentiment labels

Table 1 :
Spearman correlation scores and p-values