Adjectival polarity and the processing of scalar inferences

In a seminal study, Bott & Noveck (2004) found that the computation of the scalar inference of ‘some’ implying ‘not all’ was associated with increased sentence verification times, suggesting a processing cost. Recently, van Tiel and colleagues (2019b) hypothesised that the presence of this processing cost critically depends on the polarity of the scalar word. We comprehensively evaluated this polarity hypothesis on the basis of a sentence-picture verification task in which we tested the processing of 16 types of adjectival scalar inferences. We develop a quantitative measure of adjectival polarity which combines insights from linguistics and psychology. In line with the polarity hypothesis, our measure of polarity reliably predicted the presence or absence of a processing cost (i.e., an increase in sentence verification times). We conclude that the alleged processing cost for scalar inferencing in verification tasks is not due to the process of drawing a scalar inference, but rather to the cognitive difficulty of verifying negative information.

Several more recent proposals side with relevance theory in assuming that the presence of a processing cost for scalar inferencing varies with certain methodological and contextual factors. However, they do not necessarily commit to the relevance-theoretic assumption that relevance is paramount in deciding whether or not a processing cost will be observed. Thus, e.g., these proposals have argued that the presence of a processing cost depends on the question under discussion (Westera 2017;Ronai & Xiang 2020), the structural characteristics of the alternatives (Chemla & Bott 2014;van Tiel & Schaeken 2016), the naturalness of the utterance (Degen & Tanenhaus 2016), and, as we will discuss in much more detail later, the polarity of the scalar inference (van Tiel et al. 2019b).
Testing these different theories about the processing of scalar inferences requires operationalising the notion of a processing cost. Various proposals have been made in this respect, focusing on participants' eye movements (e.g., Grodner et al. 2010;Huang & Snedeker 2018), brain signals (e.g., Noveck & Posada 2003;Barbet & Thierry 2018), reading times (e.g., Breheny et al. 2006;Politzer-Ahles & Husband 2018), and working memory capacity (e.g., De Neys & Schaeken 2007;Marty & Chemla 2013). In this study, we focus on the idea that processing costs can be measured by looking at sentence verification times. In the next section, we briefly discuss previous studies using this measure. We show that these studies have given rise to conflicting results, and describe a recent proposal that aims to make sense of these conflicting data in terms of the polarity of scalar words. Afterwards, we turn to our own study in which we systematically and extensively tested the polarity-based explanation.

Previous sentence verification studies
In sentence verification studies, participants are presented with a sentence and have to decide whether that sentence is true or false in a given situation. This situation can be presented pictorially or correspond to participants' world knowledge. To carry out the verification process, it is often assumed that participants represent both the sentence and the situation in a common format, e.g., a proposition. In addition, participants initialise a truth index that tracks the truth value of the sentence. Sentence verification then consists in systematically manipulating and comparing the representations associated with the sentence and the situation, and carrying out operations on the truth index (cf. Clark & Chase 1972;Carpenter & Just 1975).
To examine whether the computation of scalar inferences is associated with a processing cost, Bott & Noveck (2004) tested the ⟨some, all⟩ scale in a series of sentence verification tasks. Participants in their experiments had to indicate the truth value of underinformative sentences like (2).
Some parrots are birds.
These sentences are true when interpreted literally, since, e.g., there are dogs that are mammals, but they are false when the scalar inference is computed and 'some' is interpreted as 'some but not all', since, in fact, all dogs are mammals. Hence, participants' truth judgements to these underinformative sentences are indicative of whether or not they computed a scalar inference.
In Bott and Noveck's Exp. 3, participants gave intuitive truth judgements to sentences such as (2). Many participants were ambivalent about the truth of underinformative sentences like these, varying their responses across structurally similar trials. Comparing the verification times of these ambivalent participants, Bott and Noveck found that it took participants significantly longer to answer 'false' (i.e., the answer suggesting a two-sided interpretation) than 'true' (i.e., the answer suggesting a literal interpretation). This difference in verification times was absent in a control condition with sentences that were unambiguously true or false, as in (3).
Some dogs are birds.
The pattern of results that Bott and Noveck observed suggests that the computation of scalar inferences is associated with a processing cost, at least in out-of-the-blue contexts. This

van Tiel and Pankratz
Glossa: a journal of general linguistics DOI: 10.5334/gjgl.1457 conclusion is in line with relevance theory and several more recent approaches (e.g., Chemla & Bott 2014;van Tiel & Schaeken 2016;Degen & Tanenhaus 2019), but speaks against Levinson's proposal that the default interpretation of 'some' is two-sided.
In what follows, we refer to Bott and Noveck's finding that participants take significantly longer to reject underinformative sentences like (2) than to accept them as the B&N effect. The B&N effect for the ⟨some, all⟩ scale has been replicated in numerous studies (e.g., Noveck & Posada 2003;Tomlinson Jr. et al. 2013;Chemla & Bott 2014;Cremers & Chemla 2014;van Tiel & Schaeken 2016;Ronai & Xiang 2020). At the same time, however, several studies have shown that the B&N effect does not always generalise beyond the specific case of 'some'.
For example, Chevallier and colleagues (2010) tested the ⟨or, and⟩ scale in a sentence-picture verification task. Participants in their experiment had to judge the truth value of sentences like (4) in displays showing different types of objects.
(4) There is a sun or a train.
In the target condition, the display for (4) showed both a sun and a train. Here, the sentence is literally true but false if the scalar inference is computed and 'or' is interpreted as excluding 'and'. As in Bott and Noveck's study, many participants vacillated between responding with 'true' or 'false' in the target condition. Unlike Bott and Noveck's study, however, Chevallier and colleagues did not observe a significant difference in verification times between 'true' and 'false' answers.
Even more challenging data comes from studies testing the processing of scalar words in negative sentences, as in (5)  Not all parrots are mammals.
On their literal interpretation, the sentences in (5) merely convey that there are dogs that are not insects, and that there are parrots that are not mammals. So on their literal interpretation, these sentences are true. However, the sentences in (5) may give rise to the indirect scalar inference that the corresponding sentences with 'some' are false, i.e., that at least some dogs are insects, and that at least some parrots are mammals. Clearly, these scalar inferences are false.
Cremers & Chemla (2014: Exp. 1) asked participants to give their intuitive truth judgements to sentences such as (5). As in Bott and Noveck's study, participants were ambivalent about the truth of these underinformative sentences, responding differently across structurally similar trials. However, unlike Bott and Noveck's study, Cremers and Chemla found that participants were faster when responding 'false' than when responding 'true' (recall that the B&N effect consists in slower verification times when responding 'false'). This difference in response times was absent in a control condition involving sentences that were unambiguously true or false, as in (6).
(6) a. Not all mammals are dogs. b.
Not all dogs are mammals. Romoli & Schwarz (2015) and Marty et al. (2020) found the same pattern of response times for various other types of indirect scalar inferences. These findings are noteworthy because they suggest that scalar inferences are processed differently when negation is involved. We will return to this point below.
To obtain a more comprehensive picture of the generalisability of the B&N effect, van Tiel and colleagues (2019b) tested the processing of seven lexical scales: ⟨some, all⟩, ⟨or, and⟩, ⟨most, all⟩, ⟨might, must⟩, ⟨try, succeed⟩, ⟨low, empty⟩, and ⟨scarce, absent⟩. In their Exp. 1, participants gave intuitive truth judgements to sentences containing the weaker scalar word. These sentences were presented in two types of displays. In control displays, the sentence was unambiguously true or false. In target displays, the sentence was literally true but false if the corresponding scalar inference was computed. To illustrate, Table 1 shows the materials for ⟨some, all⟩ and ⟨low, empty⟩.

van Tiel and Pankratz
Glossa: a journal of general linguistics DOI: 10.5334/gjgl.1457 In line with Bott and Noveck's study, van Tiel and colleagues found that, in the case of 'some', participants were significantly slower to answer 'false' than 'true' in the target condition, whereas no difference in verification times was observed in the control condition. Van Tiel and colleagues also observed a B&N effect for 'or' (in contrast with the aforementioned study by Chevallier et al. 2010), 'might', 'most', and 'try'. In the case of 'low' and 'scarce', however, no significant difference in verification times between 'true' and 'false' responses was observed.
To explain this pattern of results, van Tiel and colleagues rely on the notion of polarity. In particular, van Tiel and colleagues argue that only the scalar inferences associated with positive scalar words are associated with a B&N effect, and that this effect is the result of the cognitive difficulty of processing the corresponding negative scalar inference. In the next section, we first introduce the notion of polarity. Afterwards, we discuss in more detail the polarity-based explanation proposed by van Tiel and colleagues.

Polarity
Polarity is a fundamental but multifarious construct that refers to the fact that some words in natural language are positive while others are negative (cf. Horn 1989: Ch. 1-3 for an excellent overview). For example, 'warm' is usually assumed to be positive, whereas 'cold' is assumed to be negative. As this example already shows, negative words are not always explicitly marked for negativity. When negative marking is absent, these words are assumed to have an implicit negative element in their underlying semantic representation (Clark 1974;Heim 2008;Moracchini 2019). The notion of polarity has been prominently studied in linguistics and psychology; mostly disparately, but cf. Ingram et al. (2016) and Nouwen (2020) for more integrative approaches. However, these two fields have operationalised polarity in importantly different ways.
In linguistics, polarity is usually operationalised in terms of markedness, i.e., negative words tend to be marked compared to their positive counterparts (e.g., Greenberg 1966;Lyons 1968;Clark & Clark 1977;Givón 1979;Lehrer & Lehrer 1982;Lehrer 1985;Sassoon 2010;Morzycki 2015). There are various ways of determining whether or not a word is marked. One such way relies on the fact that certain words make reference to a measurement scale in their semantics (e.g., Kennedy & McNally 2005;Solt 2015). To illustrate, compare 'many' and 'few'. Both of these words operate on the quantity scale. However, whereas 'many' denotes a lower bound on the quantity scale (e.g., 'Many flowers are red' implies that the number of red flower is greater than a contextually determined threshold), 'few' denotes an upper bound (van Tiel et al. 2021). To put it differently, "many-ness" and quantity are positively related, whereas "few-ness" and quantity are negatively related. As a consequence, 'many' is usually characterised as positive and 'few' as negative.
Van Tiel and colleagues rely on this characterisation, which they call the scalarity criterion, to intuitively classify the scalar words in their sample as positive or negative. Based on this criterion, they labelled 'low' and 'scarce' as negative, and all other words as positive. Recall that 'low' and 'scarce' were also the only two scalar words that failed to give rise to the B&N effect. This concurrence between polarity and processing led van Tiel and colleagues to hypothesise that polarity is the key feature in determining whether or not a B&N effect will be observed.
However, in addition to the scalarity criterion, there are various other diagnostics of linguistic polarity. Since adjectives are of particular interest for the current study, we focus here on two ways of diagnosing the linguistic polarity of adjectives. Both of these diagnostics build on a standard assumption in linguistics that many adjectives are members of antonym pairs where one member is positive and the other is negative (e.g., Lehrer & Lehrer 1982). A first way of diagnosing adjectival polarity involves the interpretation of 'how' questions.
In particular, 'how' questions involving positive adjectives tend to be neutral, whereas those involving negative adjectives tend to presuppose that the adjective holds (e.g., Rett 2008). To illustrate, consider the questions in (7).
How long is a day on Venus? b.
How short is a day on Venus?
Whereas (7a) is neutral about whether days on Venus are long or short, (7b) intuitively suggests that the speaker believes they are short. This observation suggests that 'long' is positive, while 'short' is negative. A direct consequence of the fact that negative adjectives are biasing in 'how' questions is that they are less likely to occur in such questions than positive adjectives-e.g., the phrase 'how long' is much more frequent in the ENCOW16A corpus (a web corpus consisting of almost 17 billion tokens, cf. Schäfer & Bildhauer 2012;Schäfer 2015) than the phrase 'how short' (199,033 vs. 2,456 occurrences).
A second way of linguistically delineating positive and negative adjectives looks at ratio phrases, such as 'twice as' and 'half as' (cf. Sassoon 2010). Ratio phrases presuppose a natural zero point. For many positive adjectives, such as 'tall' and 'old', such a natural zero point is intuitively available. For example, conceptually, there is such a thing as zero tallness (i.e., being 0 centimeters tall) or zero oldness (i.e., being 0 days old). For many negative adjectives, however, there is no natural zero point. Thus, there is no such thing as zero shortness (which would correspond to infinite length) or zero youngness (infinite age). As a consequence, the positive adjective 'old' is felicitous in ratio phrases, while its negative counterpart 'young' is slightly odd, as shown by the minimal pair in (8).
She is twice as old as him. b. ?He is twice as young as her.
In line with this observation, Sassoon (2010) provides corpus data showing that positive adjectives are-as a rule-significantly more frequent than negative adjectives in ratio constructions such as 'twice as'. In line with these data, 'twice as old' was substantially more frequent than 'twice as young' in the ENCOW16A corpus (258 vs. 3 occurrences).
In psychology, polarity is usually defined in terms of subjective valence, i.e., in terms of the positive or negative connotations that people have with a particular word (e.g., Wason 1959;Osgood & Richards 1973). To illustrate, Mohammad (2018) presented participants with short lists of words and asked them to rank these based on their valence. These rankings were then converted to numeric values between 0 (indicating that the word was always ranked at the bottom of the list) and 1 (always ranked at the top). Thus, e.g., 'good' was associated with a value of 0.938; 'bad' with a value of 0.125, which reflects the intuition that 'good' is positive and 'bad' is negative.
The psychological notion of polarity as subjective valence also reverberates in natural language in several ways (e.g., Boucher & Osgood 1969;Benjafield & Adams-Webber 1976;Paradis et al. 2012). For example, it has been found that psychologically positive words are more frequently attested than negative ones (i.e., the Polyanna hypothesis formulated and tested by Boucher & Osgood 1969). In line with this idea, a search in the ENCOW16A corpus shows that 'good' is more than four times as frequent as 'bad' (10,869,258 vs. 2,289,838 occurrences).
In most cases, the linguistic and psychological notions of polarity go hand in hand, but not always. For example, as we just saw, from a linguistic perspective, 'old' is positive while 'young' is negative. From a psychological perspective, the converse holds: 'young' is positive and 'old' is negative (e.g., in the study by Mohammad 2018, participants gave 'old' a valence rating of 0.41 and 'young' a rating of 0.81). Therefore one of the main contributions of this paper is the synthesis of several (potentially conflicting) polarity diagnostics into a single continuous polarity measure. We use this polarity measure to systematically test the speculation from van Tiel and colleagues that only positive scalar words are associated with a B&N effect (recall that they relied on the scalarity criterion, a rather intuitive measure, to classify scalar words as positive or negative). However, before going into more detail about the present study, we explore why it is plausible that polarity affects the processing of scalar inferences.

A polarity-based explanation
In order to explain why only positive scalar words are associated with a B&N effect, van Tiel and colleagues rely on the observation that verification times are systematically affected by the polarity of the sentence. To illustrate, consider the three sentences in (9). These sentences vary in their polarity: (9a) is positive, (9b) contains the implicitly negative word 'below', and (9c) contains the explicit sentential negation 'not'. In what follows, we will conveniently refer to these three types of sentences as positives, implicit negatives, and explicit negatives.
(9) a. The star is above the cross. b.
The cross is below the star. c.
The cross is not above the star.
Clark & Chase (1972) measured the verification times for these three types of sentences in displays that always showed two vertically juxtaposed images. Crucially, the three sentences in (9) are all equivalent in these displays. Nonetheless, Clark and Chase found that participants were significantly faster to verify positives like (9a) than implicit negatives like (9b), which, in turn, were verified significantly more rapidly than explicit negatives like (9c). In other words, Clark and Chase found evidence for the generalisation in (10), where '<' denotes faster verification times.
(10) positive < implicit negative < explicit negative Clark and Chase's findings have been replicated in numerous studies (e.g., Wason 1972;Carpenter & Just 1975;Fodor et al. 1975;Cheng & Huang 1980;Proctor & Cho 2006). However, it should be noted that all of these studies tested sentences that were presented without any relevant context. Indeed, later work has convincingly shown that these findings do not always generalise to more contextualised settings. For example, several studies found that explicit negatives can be verified as fast as positives if the context is right (cf. Wason 1965;Nieuwland & Kuperberg 2008;Tian et al. 2010). In what follows, we will largely ignore this important qualification, since almost all verification studies on the processing of scalar inferences have made use of out-of-the-blue contexts (but cf. Ronai & Xiang 2020 for a recent exception).
To see how the generalisation in (10) may explain van Tiel and colleagues' observation that only positive scalar words are associated with a B&N effect, consider their target sentence for the positive scalar word 'some' and its scalar inference in (11).
(11) Some of the socks are pink. ⟿ Not all of the socks are pink.
Participants who interpreted the target sentence literally only had to verify a positive sentence, whereas participants who arrived at a two-sided interpretation also had to verify the explicitly negative scalar inference. Given the generalisation in (11), we may expect that verifying the explicitly negative scalar inference leads to elevated response times compared to the literal interpretation. So, even if we assume that participants who arrived at a two-sided interpretation of the target sentence verified the literal interpretation and the scalar inference in parallel, it follows that they are expected to be significantly slowed down compared to participants who interpreted the sentence literally. Now consider the target sentence for the implicitly negative scalar word 'scarce' along with its scalar inference in (12).
(12) Red flowers are scarce. ⟿ Red flowers are not absent.
Participants who arrived at a literal interpretation of the target sentence had to verify an implicitly negative sentence, since 'scarce' is implicitly negative. What about the scalar inference? Superficially, the scalar inference appears to involve a double negation in 'not absent'. Hence, intuitively, one might suppose that the verification of the scalar inference should take longer than verifying the literal interpretation. However, van Tiel and colleagues contend that the scalar inference in this case is verified at least as fast as the literal interpretation. There are various arguments that may support this proposal. First, it has been found that, in at least some cases, sentences containing two negative elements are processed more rapidly than the corresponding sentences with a single negative element. For example, Sherman (1976) found that participants were faster to verify sentences containing the double negation 'no one doubted' ('doubt' is implicitly negative) than sentences containing just the single negative word 'doubted' in an otherwise positive sentence. Hence, it could be that 'not absent' is easier to verify than 'scarce'. Second, it could be the case that the scalar inference is cognitively represented as a positive, i.e., as 'Red flowers are present'. The double negation could be eliminated on the fly, or, perhaps more plausibly from a psychological standpoint, it could be that the positive form of the scalar inference is directly associated with the scalar word, similarly to Levinson's (2000) defaultist proposal, so that participants who derive the scalar inference interpret the target sentence in (12) as 'Red flowers are scarce but present'.
In any case, if we assume that the scalar inference in (12) can be verified at least as rapidly as the literal interpretation, and if we furthermore assume that participants who compute the scalar inference verify both the literal interpretation and the scalar inference in parallel, it follows that participants who computed the scalar inference should be equally fast as participants who arrived at a literal interpretation. One might even expect participants who derived the scalar inference to be faster than participants who arrived at the literal interpretation, since the positivity of the scalar inference seems to entail that it should be verified faster than the implicitly negative literal interpretation, and since the sentence may be judged false as soon as the scalar inference is verified. However, psycholinguistic studies have consistently shown that 'false' responses to positive sentences are generally slightly delayed compared to 'true' responses, which might mitigate that verification time advantage for positives relative to implicit negatives, and thus lead to roughly equal verification times for literal and two-sided interpretations (e.g., Clark & Chase 1972).
This polarity-based explanation also harmonises with the previously discussed findings on indirect scalar inferences. To illustrate, (13) shows a target sentence from Cremers and Chemla's (2014) study, as well as its scalar inference.
(13) Not all dogs are insects. ⟿ It's not the case that not some dogs are insects.
Again, the proposal is that the scalar inference is verified more rapidly than the literal interpretation, either because the double negation is eliminated or because 'not all' is statistically associated with 'some' (rather than the equivalent 'not not some'). The reason that (13) gave rise to the reverse B&N effect-rather than the absence of any effect, as van Tiel and colleagues found for cases like 'scarce'-is that the target sentence contains an explicit negation. As noted before, explicit negatives take longer to verify than implicit negatives. Hence, participants who accepted the target sentence had to verify the more time-consuming literal interpretation, whereas participants who arrived at the two-sided interpretation verified both the literal interpretation and the (positive) scalar inference in parallel. In the latter case, participants could respond with 'false' as soon as they realised that the scalar inference was false, which took less time than verifying the literal interpretation and responding with 'true'.  However, the current support for the polarity hypothesis is comparatively thin, consisting solely of the data for 'low' and 'scarce' (as well as perhaps earlier data on indirect scalar inferences). Moreover, in addition to being the only negative scalar words tested by van Tiel and colleagues, 'low' and 'scarce' were also the only adjectival scalar words they tested. This may have influenced the results in various ways, e.g., one can imagine that adjectival scales are less salient given the openness of the grammatical class (cf. van Tiel et al. 2016).
In this study, we test the hypothesis that only positive scalar words give rise to the B&N effect in a more comprehensive and systematic way by investigating the processing of 16 adjectival scalar words of both positive and negative polarity. Rather than relying on one subjective diagnostic to classify scalar words in a binary way as either positive or negative, we combined the outcomes of four objectively measurable diagnostics for polarity to obtain a gradient measure of polarity. Consequently, we tested whether this gradient measure of polarity predicted the presence or absence of a B&N effect. In the next section, we describe our study in more detail.

Our study
Our study tested 16 adjectival scales: ⟨ajar, open⟩, ⟨breezy, windy⟩, ⟨chubby, fat⟩, ⟨content, happy⟩, ⟨cool, cold⟩, ⟨drizzly, rainy⟩, ⟨fair, good⟩, ⟨low, empty⟩, ⟨mediocre, bad⟩, ⟨passable, good⟩, ⟨ripe, overripe⟩, ⟨scarce, absent⟩, ⟨sleepy, asleep⟩, ⟨unlikely, impossible⟩, ⟨warm, hot⟩, and ⟨youthful, young⟩. For each scale, we constructed a simple sentence containing the weaker scalar word, and, for each sentence, we created three images: a target image where the sentence was literally true but where its scalar inference was false, and two control images where the sentence was unambiguously true or false. See the Appendix for an overview of the sentences and images that we tested.
Participants in the experiment first saw the sentence. Once they finished reading the sentence, they pressed the space bar to see the image. At that point, they had to indicate whether they felt the sentence they had just read was a good or bad description of the corresponding image.
We measured their verification times (i.e., the time between image onset and the point at which one of the response buttons was pressed) to establish the presence or absence of a B&N effect, i.e., to determine whether or not verification times were slower for 'false' than for 'true' answers in the target condition, vis-à-vis the control condition.
To test the polarity hypothesis-i.e., the idea that only positive scalar words give rise to the B&N effect-we had to determine the polarity of the scalar words in our study. Here, we focus on the stronger word on the scale, since the polarity-based explanation crucially makes reference to the polarity of the negated alternative, though if the literature is right, all words on a scale should share the same polarity (i.e., the scalarity constraint, cf . Fauconnier 1975;Horn 1989). In the previous section, we discussed five diagnostics that can be used to determine the polarity of adjectives: i. Scalarity: Positive words denote a lower bound on their measurement scale.
ii. Questions: Positive adjectives are neutral in 'how' questions.
iii. Ratio: Positive adjectives are more felicitous in ratio phrases like 'twice as'.
iv. Valence: Positive words are judged as having more positive connotations.
v. Frequency: Positive words are more frequent than negative ones.
The first three diagnostics reflect the linguistic notion of polarity as markedness; the last two diagnostics reflect the psychological notion of polarity as subjective valence.
Van Tiel and colleagues focused on the scalarity criterion in their study. However, in our study, we wanted to avoid using this criterion for two reasons. First, not all of the scalar words that we tested make reference to a clearly identifiable measurement scale. For example, in the case of 'open', it is unclear whether the underlying measurement scale is about openness or closedness. Second, and relatedly, the scalarity criterion crucially relies on researchers' intuitions, which are not always reliable.
Rather than relying on one specific construal of polarity, or even one specific diagnostic measure, we made use of each of the remaining four diagnostics in the list. Unlike the scalarity criterion, these four diagnostics can be operationalised using objective data. We assume here that each 10 van Tiel and Pankratz of the four diagnostics offers an approximation of a fundamental latent construct of polarity, and that, by combining these diagnostics, we are able to obtain a relatively reliable estimate of that construct. Crucially, we assume that polarity is gradient rather than binary; that is, words can be positive or negative to varying degrees. In making this decision, we do not want to question the value of focusing on one specific construal of polarity, e.g., in terms of markedness. However, even in the extensive line of work focusing on linguistic polarity, it has been observed that there is no fail-proof way of establishing the polarity of a word that consistently accords with linguists' intuitions (e.g., Rett 2008;Sassoon 2010;Gotzner et al. 2018). By combining data from different diagnostics, we may mitigate potentially counterintuitive outcomes from any single diagnostic.
Hence, for each of the stronger scalar words in our experiment-as well as their corresponding antonyms-we obtained four measures, corresponding to the last four diagnostics in the list above: (i) their frequency in the phrase 'how [adjective]', (ii) their frequency in the phrases 'twice as [adjective]' and 'half as [adjective]', (iii) their valence ratings as reported by Mohammad (2018), and (iv) their overall frequency. The corpus counts for (i), (ii), and (iv) were taken from the ENCOW16A corpus (Schäfer & Bildhauer 2012;Schäfer 2015), and the counts for (i) and (ii) were relativised to the adjectives' overall frequency. The corpus frequencies were always logarithmised as a way of reducing skewness. The outcome of each measure for the stronger scalar word was divided by the outcome of that measure for its antonym. Thus, values greater than 1 indicate that the stronger scalar word was positive relative to its antonym; values between 0 and 1 indicate the reverse. These resulting ratio values are provided in Table 3  The values on the first principal component are shown in We used the values from the first principal component to test the polarity hypothesis, which predicts that the B&N effect should interact with polarity. Recall that the B&N effect consists of slower verification times when responding 'false' than 'true' in the target condition vis-àvis the control condition; i.e., the B&N effect consists in an interaction effect on verification times between condition (target vs. control) and response ('true' vs. 'false'). Hence, the polarity hypothesis predicts a significant three-way interaction between condition (target vs. control), response ('true' vs. 'false'), and polarity, such that the relative increase in verification times for 'false' compared to 'true' responses in the target condition increases with increasing polarity of the adjective.
The next section describes our experiment in more detail, followed by the results. All data and analysis files can be accessed at https://osf.io/wxmeq/.

Materials
As mentioned above, the materials consisted of 16 adjectival scales. For each scale, we created a simple sentence containing the weaker scalar word. For each sentence, we created three types of images: one image where the sentence was unambiguously true, one image where it was unambiguously false, and one image where the sentence was true on its literal interpretation but false if its scalar inference was derived. Table 3 shows the lexical scales that were tested; the Appendix provides the sentences and images used in the experiment.
The materials were pretested in two experiments with 25 participants each. Based on these pretests, we made several adjustments to the sentences and images to ensure that participants responded as expected, i.e., rejected the sentence in the false control condition, accepted it in the true control condition, and vacillated between accepting and rejecting the sentence in the target condition. (Here, vacillation was defined as significantly fewer 'true' responses than in the true control condition and significantly fewer 'false' responses than in the false control condition.) The experiment presented each sentence-image pair three times, and thus comprised 16×3×3 = 144 trials in total. The order of presentation was randomised for each participant.

Procedure
Participants were instructed to indicate whether or not the sentence was a good description of the image. They could register their judgement by pressing either '1' (to answer in the positive) or '0' (to answer in the negative) on their keyboard. Trials started with the presentation of the sentence. Upon pressing the space bar, the sentence was replaced by the image, whereupon participants could give their truth judgements. We measured the time from image onset to button press.

Data treatment
3 participants were removed from the analyses because their accuracy on control items was below 80%. In addition, we removed trials with a verification time faster than 200 milliseconds 12 van Tiel and Pankratz Glossa: a journal of general linguistics DOI: 10.5334/gjgl.1457 or slower than 10 seconds, assuming that these correspond to accidental button presses or inattentiveness to the task at hand. This resulted in the removal of 14 trials (less than 0.1% of the data). Figure 1 shows the percentages of 'true' responses for each scalar word and condition. Performance in the control condition was close to ceiling. In the 'true' control condition, performance ranged from 87% for 'warm' to 100% for 'sleepy' and 'content'. In the 'false' control condition, performance ranged from 86% for 'unlikely' to 100% for 'ajar' and 'youthful'.

Results
In the target condition, the percentage of 'true' responses ranged from 16% for 'mediocre' to 80% for 'youthful'.
One of our reviewers rightly observed that the percentage of 'true' responses in the target condition for 'mediocre' (16%) was so close to the percentage of errors in the 'false' control condition (11%) for that scalar word that one might call into question our assumption that the 'not bad' inference is a bona fide scalar inference rather than being an aspect of the lexical meaning of 'mediocre'. However, given that 'mediocre' behaved in line with our assumption in the pretest (26% 'true' responses in the target condition vs. 7% errors in the 'false' control condition), we decided to retain this item in our analyses. We want to emphasise that removing 'mediocre' does not have any noteworthy consequences for the analyses that we report below.
Next, we considered participants' verification times. Figure 2 shows the mean logarithmised verification times for aggregated positive and negative scalar words. Here, we classified a scalar word as positive if its sign on the first principal component was positive; otherwise as negative (cf. Table 3). Figure 2 shows the B&N effect for positive scalar words, but not for negative ones. Note that, as discussed previously, in the statistical analyses, we treated polarity as a continuous rather than binary factor; this visualisation is only intended to summarise our continuous polarity measure. Figure 3 shows the mean logarithmised verification times for each scalar word separately.
To test the polarity hypothesis, we constructed a linear regression mixed effects model predicting logarithmised response times based on condition (target vs. control), response ('true' vs. 'false'), polarity, and their interactions, including random intercepts for participants and scalar words. For all analyses, degrees of freedom and corresponding p-values were estimated using the   Satterthwaite procedure, as implemented in the 'lmerTest' package (Kuznetsova et al. 2013). The model also included response bias (i.e., the proportion of 'true' responses in the target condition) and trial number (linearly rescaled to the [0, 1] interval) as main effects. A response bias could lead to slower non-modal responses; trial number was included to capture learning effects. The model showed a highly significant interaction between condition, response, and polarity (b = 0.09, SE = 0.01, t = 6.00, p < .001). This interaction confirmed that polarity modulated the presence or absence of the B&N effect. There was no significant effect of response bias (b = 0.02, SE = 0.02, t < 1); however, there was a significant effect of trial number (b = -0.50, SE = 0.02, t = 32.46, p < .001) indicating that participants responded increasingly more rapidly throughout the experiment.
To also obtain a more fine-grained picture of the scope of the polarity hypothesis, we checked, for each scalar word separately, whether a B&N effect was present or not. To this end, for each scalar word, we constructed a linear regression mixed effects model predicting logarithmised response times on the basis of condition, response, their interaction, and trial number, including random intercepts for participants with random slopes for condition and response (but not their interaction). We observed significant interaction effects for 'content', 'fair', 'passable', 'ajar', 'chubby', 'warm', and 'youthful' (see Table 4 for the full model parameters).   Taken together, these results confirm the polarity hypothesis: the presence or absence of a B&N effect was modulated by the polarity of the scalar words. More specifically, all of the positive scalar words gave rise to a B&N effect, while almost none of the negative scalar words did. The only exception to this rule was 'youthful', which was associated with a B&N effect despite being assigned a (slightly) negative polarity value (-0.85).

Summary
Pragmatic theories make conflicting predictions about the processing of scalar inferences in outof-the-blue contexts. Relevance theory predicts that, in such contexts, the literal interpretation should be easier to retrieve than an interpretation that is enriched with a scalar inference. By contrast, Levinson (2000) predicts that it is the literal interpretation that should incur a processing cost, since it involves overturning the default enriched interpretation.
In a seminal study, Bott & Noveck (2004) found that the computation of the scalar inference of 'some' implying 'not all' was associated with increased sentence verification times. This B&N effect seems to provide strong evidence for the relevance-theoretic idea that the derivation of scalar inferences without a facilitating context is cognitively costly. However, more recent studies observed that the B&N effect does not consistently generalise beyond the ⟨some, all⟩ scale, which begs the question whether the B&N effect is really caused by the processing of the scalar inference (e.g., Chevallier et al. 2010;Romoli & Schwarz 2015;van Tiel et al. 2019b).
To explain these findings, van Tiel and colleagues (2019b) hypothesised that the presence or absence of a B&N effect depends on the polarity of the scalar word.
The polarity-based explanation proceeds from the observation that verification times vary with the polarity of the sentence (e.g., Clark & Chase 1972;Carpenter & Just 1975). In particular, positive sentences are verified faster than sentences containing an implicitly negative word (e.g., 'low'), and the latter are verified faster than sentences containing an explicit negation (e.g., 'not all'). Correspondingly, given that all words on a scale have the same polarity, there are essentially three options: the words on a lexical scale may be positive, inherently negative, or explicitly negative. Hence, given that scalar inferences consist in the negation of the stronger scalar word, the polarity of the scalar inference is either negative (for positive scalar words) or doubly negative (for inherently or explicitly negative scalar expressions). Crucially, the polarity-based explanation argues that the latter are easier to verify than the former.
Various explanations for this potentially controversial assumption may be given. First, it could be that certain propositions containing a double negation are in fact easier to process than propositions with a single negation. Second, it could be that participants eliminate the double negation on the fly as they encounter it. Both of these explanations can be empirically tested by asking participants to verify the doubly negated scalar inferences (e.g., 'The battery is not empty') and comparing the verification times with those for the literal interpretation of the target sentence (i.e., 'The battery is low'). If either of the foregoing explanations is on the right track, 'false' responses to the negated scalar inference should be at least as fast as 'true' responses to the target sentence.
However, concerning the former explanation, prior research has shown that, in many cases, verification times increase with the number of negations (Sherman 1976). Hence, from a psychological perspective, perhaps the most plausible explanation is the latter: the positive form of the scalar inference is directly associated with its triggering expressions. According to this explanation, from a cognitive standpoint, the derivation of scalar inferences does not involve nonce construction and negation of alternatives, but rather resembles a form of disambiguation (cf. also Marty & Chemla 2013 for relevant comments about the parallelism between scalar inferencing and disambiguation).
If, finally, we assume that participants who arrive at a two-sided interpretation verify the literal interpretation and the scalar inference in parallel, the correct predictions follow straightforwardly: positive scalar words give rise to the B&N effect, inherently negative scalar words do not, and explicitly negated scalar words lead to the reverse B&N effect.

van Tiel and Pankratz
Glossa: a journal of general linguistics DOI: 10.5334/gjgl.1457 We extensively and systematically tested this polarity-based explanation by comparing the processing of 16 adjectival scalar inferences using a sentence-picture verification task. We estimated the polarity of the scalar words in our sample (and, hence, of the corresponding scalar inferences) on the basis of four diagnostics measuring their linguistic markedness and psychological valence. We found that the presence or absence of a B&N effect was strongly dependent on the polarity of the lexical scale. Indeed, of the 7 lexical scales whose inferences led to increased verification times, 6 were estimated to be positive. The sole exception was ⟨youthful, young⟩, which was associated with a B&N effect despite being classified as (somewhat) negative.
One interesting observation for this particular scale is that the valence criterion "correctly" classified this scale as positive rather than negative (i.e., in accordance with the behaviour of 'youthful' in the experiment, and in contrast to its estimated polarity). Hence, one may jump to the conclusion that the valence criterion generally offers a better measure of polarity than the other diagnostics. However, this does not hold true across the board. For example, 'rainy' also had positive valence ratings relative to 'dry', but the ⟨drizzly, rainy⟩ scale was not associated with a processing cost.
Another scalar word that merits some further discussion is 'unlikely'. 'Unlikely' was the only scalar word in our sample that was explicitly marked for negativity by means of the negative prefix 'un-'. Hence, one might expect to find a reverse B&N effect for this particular scalar word, i.e., one might expect that it patterns with explicitly negative scalar constructions like 'not all' rather than with implicitly negative scalar words like 'low'. This prediction was not borne out. However, on closer inspection, this finding is not so surprising. To explain, consider the hierarchy of negation proposed by Fodor and colleagues (1975). According to Fodor and colleagues, negativity may have various sources. Ranging from "most negative" to "least negative", these are as follows: i. Explicitly negative free morpheme (e.g., 'not').
iv. Free morphemes that are defined in negative terms (e.g., 'bachelor' meaning someone who is not married, or 'kill' meaning causing someone to not be alive).
'Unlikely' is of class ii, whereas the other negative scalar words that we tested are all of class iii. Crucially, however, in terms of cognitive processing, Fodor and colleagues report that negative words of class ii pattern with implicitly negative words from class iii, rather than with the explicitly negative ones from class i.
Taken together, then, it is clear that there is a strong connection between polarity and the B&N effect. Perhaps most forcefully, it seems difficult to explain why the scalar inference of 'warm' but not 'cool' was associated with a processing cost without appealing to the notion of polarity, especially given that the sentences and images used for these scalar words were so similar (the images showed transparent drinking glasses containing water at different temperatures, from a block of ice to vigorously boiling, cf. Appendix). Hence, we view our results as strong support for the polarity-based explanation.

Processing scalar inferences
Crucially, if the polarity-based explanation is correct, the classic observation that certain scalar inferences lead to increased verification times is not reflective of any processing cost for scalar inferencing, but rather reflects the psychological difficulty of verifying negative information.
Indeed, the polarity-based explanation leads us to conclude that scalar inferencing itself is not associated with a processing cost, even in the absence of a facilitating context, contra, e.g., relevance theory. At the same time, however, our results also fail to support the defaultist idea that scalar inferences arise temporally prior to the literal interpretation; nowhere did we observe faster processing times when people computed the scalar inference.
Rather, it seems that scalar inferences are conventionally or statistically associated with their triggering expressions. That is, when hearing an utterance containing the weaker scalar 16 van Tiel and Pankratz word, people may immediately activate the corresponding scalar inference at no processing cost. However, when verifying this scalar inference, a processing cost may ensue if the scalar inference is negative, since negative information takes longer to be verified (e.g., Clark & Chase 1972;Carpenter & Just 1975).
One might suppose that this explanation is not very "Gricean" in spirit. Note, however, that Grice (1975) himself acknowledged the possibility that conversational implicatures are "intuitively grasped" (p. 50). What is crucial for Grice is whether or not this intuition is in principle "replaceable by an argument" (ibid.) that takes the literal meaning of the utterance and the assumption of cooperativity as its premises (see also Geurts & Rubio-Fernández 2015). However, such arguments should not necessarily be construed as psychologically real, i.e., hearers presumably do not actually construct such an argument every time they encounter a scalar word. Rather, the Gricean calculations provide a rational grounding of the inferences that hearers are entitled to derive (Kissine 2016;Geurts 2019;Dänzer 2020).
Our proposal makes some empirically testable predictions. In particular, it is predicted that other experimental paradigms that make use of sentence verification should also be susceptible to the polarity effect: if the proposition to be verified contains negative information, its verification should be cognitively costly. One such paradigm that relies on sentence verification was introduced by De Neys & Schaeken (2007). Their study essentially mirrored Bott and Noveck's (2004: Exp. 3) in that participants gave intuitive truth judgements to underinformative sentences like (14). Crucially, however, De Neys and Schaeken required participants to memorise dot patterns of varying complexity during the process of sentence verification.
De Neys and Schaeken found that participants were less likely to respond 'false', i.e., to derive the scalar inference, when they had to memorise complex dot patterns compared to simple ones (cf. also De Neys & Schaeken 2007;Dieussaert et al. 2011;Marty & Chemla 2013;van Tiel et al. 2019a;Cho 2020;Marty et al. 2020). We refer to this finding as the D&S effect.
De Neys and Schaeken explain the D&S effect based on the premise that the derivation of scalar inferences is associated with a processing cost. According to their explanation, participants who had to memorise complex dot patterns had fewer cognitive resources available to derive the scalar inference, and consequently were less likely to carry out the derivation process. However, if the polarity-based explanation is on the right track, the D&S effect could also be modulated by the polarity of scalar words.
Interestingly, van Tiel and colleagues (2019b) also carried out a working memory load experiment for the same seven lexical scales that they tested in the sentence picture verification task that we discussed in the introduction. They found that, whereas all positive scalar words were associated with a D&S effect, the two negative scalar words in their sample-i.e., 'low' and 'scarce'-were not. This observation suggests that the D&S effect is also susceptible to polarity in the same way as the B&N effect.
However, in a more recent study, Marty and colleagues (2020) provide provocative evidence against this conclusion by showing that the D&S effect is attested for indirect scalar inferences. For example, they found that, in displays showing only green apples, participants were less likely to reject sentences such as (15) when they had to memorise complex patterns than when they had to memorise simple ones.
(15) Not all of the apples are red.
To explain this pattern of results, Marty and colleagues (following Marty & Chemla 2013) argue that a processing cost emerges when participants make the cognitively costly decision to go beyond the literal interpretation. However, in line with the polarity-based explanation, they also hold that the process of deriving the scalar inference (i.e., the construction and rejection of alternatives) proceeds without any processing cost.
These findings paint a complex picture that obviously calls for a more detailed discussion than we can offer here, but they clearly show that, for other measures, too, there has been debate about whether the locus of the alleged processing cost is in the process of scalar inferencing or in some other relevant cognitive process (e.g., Marty  2018; Sun & Breheny 2019). This contribution fits into that line of work in showing that polarity is one of the factors influencing verification times in sentence verification tasks.
It is an open question whether polarity also influences other measures of processing cost, e.g., those involving reading times and eye movements. There is at least some evidence indicating that these measures, too, are influenced by the polarity of a sentence. For example, Glenberg et al. (1999) report longer reading times for negative sentences. Similarly, Tian et al. (2010) report that eye fixations to the correct image in a visual world paradigm are delayed for negative sentences compared to positive ones. However, both of these studies focus on sentences containing the explicit sentential negation 'not', rather than the implicitly negative words that we studied here. Hence, as it stands, it is unclear whether polarity has any explanatory role for experimental studies on the processing of scalar inferences using reading times and eyetracking. Future work will have to determine whether polarity has pervasively influenced the experimental literature on scalar inference processing, or whether the effect is restricted to sentence verification as we have tacitly assumed throughout this paper.

Polarity
To estimate the polarity of the scalar words in our sample, we combined insights from linguistics and psychology. In linguistics, polarity is usually construed in terms of markedness; in psychology, in terms of subjective valence. We obtained measures of both construals, and used those to estimate a hypothesised latent construct of polarity. Here, we depart from (and hopefully improve on) prior research, particularly in linguistics.
Much linguistic research is premised on the idea that polarity is a binary notion, i.e., in any antonym pair, one is positive and the other one negative (e.g., Ruytenbeek et al. 2017;Gotzner et al. 2018). This approach regularly leads to an aporia. For example, Sassoon (2010) sought to determine polarity by looking at the frequency of antonym pairs in the 'twice as [adjective]' frame. In line with her hypothesis, intuitively positive adjectives tended to be more frequent in such frames than negative ones. However, there were ample exceptions. Thus, Sassoon found that 'twice as bad' was more frequent than 'twice as good', although she intuited that 'good' is positive and 'bad' is negative.
We observed many similar conflicts between diagnostics (cf. Table 3). For example, in Mohammad's (2018) subjective valence study, 'rainy' was rated as more positive than 'dry', suggesting that 'rainy' is positive and 'dry' negative. However, the other diagnostics suggested the opposite conclusion. As one of our reviewers suggested, the "aberrant" outcome in terms of subjective valence could be due to the fact that 'dry' is highly polysemous, and that its subjective valence depends on which meaning is selected. For example, dry weather is generally considered positive, while dry bread is clearly negative. Hence, it could be that participants in Mohammad's study tended to construe 'dry' in the latter negative sense rather than the former.
One way of resolving such clashes between diagnostics is by incorporating the results of multiple diagnostics into the estimation of a gradient measure of polarity. We hope this approach finds a following in linguistic and psychological research on polarity.

Conclusion
Perhaps above all, our results emphasise the importance of testing broader samples of scalar words in research on scalar inferences-not just to determine whether psychological effects generalise across the entire family of scalar words, but also because it offers an insight into the structural linguistic constructs that underlie language processing. Thus, our study has shown the psychological relevance of the notion of polarity. We hope our research will inspire others to revisit the many interesting findings that have been reported on the scalar inference of 'some' to see if they generalise and, if not, what factors may explain the observed scalar diversity, so that we may come to a better understanding of the cognitive processes that underlie the derivation of scalar inferences-and ultimately pragmatic inferences more generally.