Abstract knowledge versus direct experience in processing of binomial expressions

We ask whether word order preferences for binomial expressions of the form A and B (e.g. bread and butter ) are driven by abstract linguistic knowledge of ordering constraints referencing the semantic, phonological, and lexical properties of the constituent words, or by prior direct experience with the speciﬁc items in questions. Using forced-choice and self-paced reading tasks, we demonstrate that online processing of never-before-seen binomials is inﬂuenced by abstract knowledge of ordering constraints, which we estimate with a probabilistic model. In contrast, online processing of highly frequent binomials is primarily driven by direct experience, which we estimate from corpus frequency counts. We propose a trade-off wherein processing of novel expressions relies upon abstract knowledge, while reliance upon direct experience increases with increased exposure to an expression. Our ﬁndings support theories of language processing in which both compositional generation and direct, holistic reuse of multi-word expressions play crucial roles. (cid:1) 2016 The Authors. Published by Elsevier B.V. ThisisanopenaccessarticleundertheCCBYlicense(http:// creativecommons.org/licenses/by/4.0/).


Introduction
When we encounter common expressions like I don't know or bread and butter, do we process them word-by-word or do we treat them as holistic chunks? Research on sentence processing has largely focused on how single words are combined into larger utterances, but intuitively it seems that high frequency multi-word expressions might be processed holistically, even if they could in principle be treated compositionally. Recent research has thus questioned what possible sizes of combinatory units should be considered as the building blocks of sentence processing: Must all multi-word expressions be generated compositionally each time they are used, or can the mental lexicon contain holistic representations of some multi-word units?
The primary diagnostic for this question is whether the frequency of occurrence of multi-word expressions is predictive of their behavior in language processing. Such frequency effects are well documented at the level of individual words: more frequent words are faster to read (Inhoff & Rayner, 1986;Rayner & Duffy, 1986;Rayner, Sereno, & Raney, 1996), more likely to be skipped in reading Rayner & Well, 1996), and more susceptible to phonetic reduction (Bybee, 1999;Gregory, Raymond, Bell, Fosler-Lussier, & Jurafsky, 1999). But do comparable frequency effects exist for multi-word expressions, when the frequency of their component words is controlled for? If the frequency of a given expression is being mentally stored, this implies that there is a mental representation of the expression as a whole. In contrast, if there are no frequency effects at the level of multiword expressions, this is evidence against them having holistic representations akin to those of individual words.
A traditional view of grammar does not include holistic representations of multi-word expressions. According to this view, there is a strict separation between the individual words of a language and the rules for combining them. Pinker (2000), for example, describes a ''traditional words-and-rules theory" in which ''there are two tricks, words and rules. They work by different principles, are learned and used in different ways, and may even reside in different parts of the brain." (See also Ullman, 2001;Ullman et al., 2005.) One tenet of this theory is that forms which can be generated compositionally are not stored: for instance, in the case of the English past tense, irregular forms are stored, while regular forms are generated anew using the -ed suffix each time they are used (Pinker, 1991). It remains possible within this theory that some regular forms-particular extremely high frequency onesmay be stored as well, but this is not the general method for dealing with such forms. As Pinker (2000)  for this theory is memory constraints on the representation of language knowledge: it is more efficient to store a single, widely applicable rule than to store each regular form individually.
In a similar vein, this theory predicts that multi-word expressions should not be stored holistically because they can be generated compositionally, except in the case of non-compositional exceptions such as idioms (Swinney & Cutler, 1979). Again, as with regularly inflected wordforms, some exceptions may exist, but the exponentially larger number of multi-word expressions with which people have experience makes it even less likely that these expressions would be stored holistically, given the motivating concern with storage efficiency. The words-and-rules theory thus does not predict that the processing of a multi-word expression will be affected by the frequency of the expression as a whole, though it can be affected by the frequencies of the individual words making up the expression. 1 In contrast, there exists a growing movement of grammatical theories that do not draw a sharp distinction between the lexicon and the combinatory rules (e.g. Baayen, Milin, Durdevic, Hendrix, & Marelli, 2011;Bybee, 2001Bybee, , 2006Gahl & Yu, 2006;Goldberg, 2003;Hay & Bresnan, 2006;Johnson, 1997Johnson, , 2006Langacker, 1987;Pierrehumbert, 2000;van den Bosch & Daelemans, 2013). Rather than conceiving of rules as static entities dissociated from the lexicon, these usage-based approaches instead conceive of rules as dynamically generated generalizations over one's linguistic experience. In particular, many of these approaches (notably Bybee, 2001;Hay & Bresnan, 2006, among others) claim that people mentally store exemplars, or tokens of linguistic experience, which can be larger than single words. Language users then form generalizations from exemplars at multiple levels of granularity (e.g., morpheme, word, or phrase) simultaneously, and the resulting network of generalizations constitutes our grammatical knowledge. Single words and multi-word expressions are thus on an equal footing: both are possible units that can be inferred from exemplars, and frequencies of multi-word expressions are predicted to be stored and tracked just as frequencies of single words are.
Similar claims are made by exemplar-based computational models, which, like the exemplar-based grammatical theories, can incorporate combinatorial units of varying sizes from morphemes to sentences (e.g. Bod, 1998Bod, , 2008Bod et al., 2003;Johnson, Griffiths, & Goldwater, 2007;O'Donnell, Snedeker, Tenenbaum, & Goodman, 2011;Pierrehumbert, 2000;Post & Gildea, 2013). Within these models, the process of learning a grammar is explicitly one of deciding what sizes of units are most applicable or probable to explain the available language data. Under the learned grammars, many utterances can be parsed in multiple ways, either as combinations of individual words, or as holistic expressions, or various combinations thereof.
Evidence for these usage-based theories in the domain of multiword expressions comes in large part from previous demonstrations of phrase-level frequency effects. Bybee (2006) reviews numerous corpus analyses demonstrating that the frequency of multi-word expressions is predictive of phonological reduction, grammaticalization, and other properties of usage, with a focus on highly frequent expressions such as I don't know or going to. Frequency effects for multi-word expressions have also been demonstrated in a controlled experimental setting: in a phrasal-decision task (analogous to a lexical decision task), Arnon and Snider (2010) found that more frequent phrases-e.g. Don't have to worry-were judged to be sensible phrases of English faster than less frequent phrases matched for word and substring frequencies-e.g. Don't have to wait. They further demonstrate that these effects exist across a wide range of frequencies, not just at the highest end of the frequency spectrum. (For a comparable finding using phonetic duration in corpus data, see Arnon & Cohen Priva, 2013. Similar frequency effects have also been found in child language acquisition; see Bannard & Matthews, 2008.) The exemplar-based approach also accords with more recent work on idioms, which challenges the traditional notion of idioms as strictly non-compositional. Gibbs (1990) and Nunberg, Sag, and Wasow (1994) argue that many idioms can be seen as conventionalized metaphoric extensions of their literal meanings, and thus need not be treated as exceptions to the prevailing rules. (Similarly, see Holsinger, 2013.) On the whole, we thus see a broad shift towards recognizing that many expressions reside in a grey zone between entirely compositional and entirely non-compositional, and furthermore that an expression may be conventionalized while still being at least somewhat compositional.
But there remain open questions regarding these exemplarbased approaches and the interpretation of frequency effects for multi-word expressions. One limitation in the work to date is that it is difficult to differentiate the effects of language experience per se from the effects of real-world knowledge. Bybee (2006), for example, stresses the importance of language experience: As is shown here, certain facets of linguistic experience, such as the frequency of use of particular instances of constructions, have an impact on representation that we can see evidenced in various ways. . . However, much of her cited evidence conflates linguistic experience with real-world experience. For example, in the phonological reduction of extremely frequent phrases such as I don't know, is this reduction due to the frequency of the linguistic expression per se, or is it due to the frequency of the event of not knowing something? Similarly, in the case of Arnon and Snider's contrast between phrases such as Don't have to worry and Don't have to wait, there could be a difference in the real-world likelihood of the events described by these expressions, which causes faster processing due to the difference in conceptual predictability, as opposed to linguistic predictability. 2 In general, this confound between linguistic experience and real-world knowledge exists whenever one compares expressions describing different real-world events.
Another outstanding question is how to empirically measure the trade-off between the reuse of stored multi-word expressions and the compositional generation of expressions. In the case of novel or infrequently attested expressions, we assume that such expressions must be processed compositionally using abstract linguistic knowledge-that is, generalized knowledge that is not bound to specific lexical items or expressions. In the case of frequently attested expressions, two potential processing strategies exist: compositional generation or reuse of stored holistic representations. Previous experimental work has primarily focused on the question of whether there is any reuse of stored multi-word expressions, and has suggested that there is at least some, but it remains possible that even very frequent and conventionalized multi-word expressions could in part or at times also be generated anew using abstract knowledge. Thus the major question now is to what extent both holistic reuse and compositional generation play a role in language processing (Wiechmann,Kerz,Snider,& Jaeger,1 It may be possible to accommodate frequency effects for multi-word expressions under this theory, depending upon further details of the parser. In particular, processing of later words in an expression could be conditioned upon earlier words, thus creating an overall frequency difference. But this is not a direct prediction of the words-and-rules theory. 2 Arnon and Snider did attempt to control for this real-world likelihood difference by collecting plausibility ratings for their materials, which they demonstrated did not differ in plausibility between conditions. However, plausibility in all conditions was very high, so extent differences may not have been detected due to ceiling effects. 2013). As mentioned above, computational models have attempted to address this question by simulating what combination of linguistic units of varying sizes most parsimoniously predict corpus data (Bod et al., 2003;O'Donnell et al., 2011;Post & Gildea, 2013). But there has been no attempt so far to directly measure the competing influences of reuse and generation via behavioral experimentation.
Our work here does just that: we will quantify the extent to which people's processing of attested expressions is influenced by their frequency of direct experience with those specific expressions versus by the abstract linguistic knowledge that allows them to generate such expressions compositionally. To do so, we need to investigate a linguistic construction for which we can independently estimate people's frequency of direct experience and their abstract knowledge of its composition. Moreover, we want a construction with wide variation in how frequently attested specific instances of the construction are, so that we can measure how the influence of these competing explanations changes as a function of the overall frequency of an expression. For these reasons, an ideal construction is binomial expressions.

Binomial expressions
In this paper, we will address the generation and reuse of multiword expressions by focusing on binomial expressions of the form A and B, such as bread and butter or sweet and sour. We include in our definition of binomial expressions all potential items with this form, including unattested expressions (e.g. bishops and seamstresses). Although binomial expressions are sometimes taken to include expressions with other conjunctions (e.g. or), here for simplicity we consider only expressions joined with and. Many binomial expressions have a preferred order (e.g. not butter and bread or sour and sweet), but binomials vary in how strong these ordering preferences are: some binomials are entirely fixed in order, or frozen (e.g. safe and sound/ ⁄ sound and safe), while others are quite free (e.g. television and radio/radio and television). Binomial expressions are thus a case of multi-word expressions that vary along two dimensions: how frequent they are, and how conventionalized their order is.
What causes binomial ordering preferences? One possibility is that preferences arise from abstract linguistic constraints that reference phonological, semantic, or other lexical properties of the elements in a binomial (e.g. the shorter word should come first). An alternate possibility is that preferences are driven by direct experience with the specific binomials in question: an order is preferred because it has been experienced more often.
Binomial expressions thus allow us to study the trade-off between abstract knowledge and direct experience. Specifically, we ask whether ordering preferences for binomial expressions are driven by direct experience with these expressions or by abstract constraints on the order of their elements. Moreover, we ask whether the influence of these two knowledge sources changes as a function of the frequency of an expression.
Additionally, binomial expressions are particularly suitable for studying effects of language experience per se, as opposed to real-world knowledge or other confounds, because the formal syntactic and semantic properties of these expressions are preserved regardless of ordering. Binomial expressions thus have an inherent control condition, unlike Bybee's (2006) investigation of high frequency expressions-whose other potentially relevant linguistic properties (e.g. unigram word frequencies) are not explicitly controlled-or the use of control expressions describing different real-world events by Arnon and Snider (2010, e.g., Don't have to worry vs. Don't have to wait). We can thus study the effects of direct linguistic experience on binomial expressions by manipulating their ordering while minimizing confounds.

Previous work on binomial ordering preferences
Siyanova-Chanturia, Conklin, and van Heuven (2011) demonstrated online effects of binomial ordering preferences: In an eye-tracking study, participants read common binomial expressions in either their preferred or dispreferred order, embedded in sentence contexts, e.g.: (1) John showed me pictures of the bride and groom both dressed in blue.
(2) John showed me pictures of the groom and bride both dressed in blue. 3 Expressions were read faster in their preferred order. Is this reading time difference due to the frequency of people's direct experience with these specific expressions or to their abstract knowledge of constraints on binomial ordering? It has long been known that at least in certain contexts, binomial ordering preferences are sensitive to a variety of semantic, phonological, and lexical constraints, but the degree to which these constraints apply in online processing remains unclear. Early work portrayed these constraints as contributing to the diachronic longevity of expressions, while more recent work has suggested, albeit inconclusively, that such constraints play a role online as well.
Much of the existing work on binomial ordering preferences relies upon corpus analyses or analyses of hand-selected examples. Malkiel (1959) was the first to propose that the relationship between words in a binomial could contribute to the prominence or longevity of the expression. Based on hand-selected examples of frozen binomials, he proposes a number of constraints on ordering, both semantic and phonological, as well as discussing other possible relationships between words (e.g., rhyming and alliteration). A more extensive study of binomial ordering preferences was carried out by Cooper and Ross (1975), whose work focuses on demonstrating a Me First constraint, which posits that ''first conjuncts refer to those factors which describe the prototypical speaker." (This prototypical speaker is later described as ''Here, Now, Adult, Male, Positive, Singular, Living, Friendly, Solid, Agentive, Powerful, At Home, and Patriotic, among other things.") They further introduce a number of phonological constraints on ordering, noting that the various constraints seem to differ in strength and may interact with each other, but they do not attempt to quantify these strengths or their interactions. Their investigation is based on a hand-selected sample of common binomial expressions, and they explicitly frame their discussion in terms of constraints that contribute to the diachronic longevity of an expression. Fenk-Oczlon (1989) introduced the idea that these constraints might apply to online processing as well as diachronic language change, arguing that most of Cooper and Ross's proposed constraints could be subsumed under the constraint that ''the more frequent and therefore informationally poorer elements tend to occupy initial position" and that this new constraint is motivated by cognitive principles. His argument is supported by corpus data, but he does not provide any evidence from online processing measures. Similarly, Sobkowiak (1993), again based on corpus data, suggests that most of the previously proposed constraints can be subsumed under a principle of ''unmarked-before-marked", which he relates to the information structure principle of ''given before new".
More recent work has stopped attempting to unify disparate constraints and has instead focused on determining the relative rankings or weights of different constraints. In particular, Benor and Levy (2006) surveyed a large number of proposed constraints on ordering preferences from the previous literature, and considered a variety of probabilistic modeling frameworks for combining them. They found that a logistic regression model best predicts ordering preferences for a large selection of binomial expressions randomly selected from a corpus. Similarly, Mollin (2012) inferred a hierarchy of constraints from corpus data and found comparable rankings to those found by Benor and Levy. While the existence of binomial ordering constraints in corpus data is well demonstrated, it is unclear whether these constraints apply only diachronically or whether they have synchronic cognitive status. Offline experimental tasks have suggested the synchronic cognitive reality of some constraints, mostly phonological. Using a forced-choice preference task in which subjects choose between possible orders of a binomial expressions, Bolinger (1962) demonstrated a preference to avoid having two stressed syllables in a row, comparable to findings in other domains of grammatical encoding (Jaeger, 2006;Lee & Gibbons, 2007). Pinker and Birdsong (1979) used a rating task with nonce words to argue for four phonological constraints, including ''Panini's Law" (the shorter word, measured in syllables, should come first; named after a 4th Century B.C. Sanskrit linguist), as well as constraints on vowel quality, vowel length, and initial consonant obstruency. Wright, Hay, and Bent (2005) used a forced-choice preference task to demonstrate that male names preferentially precede female names, even when phonology and frequency are controlled for. Moreover, they showed that male names tend to have ''first-position" phonological properties and are on average more frequent than female names. These offline tasks demonstrate that at least some abstract constraints on ordering are synchronically cognitively active, but they do not demonstrate whether these constraints are available during real-time language processing or whether they are available only upon later reflection.
Prior to Siyanova-Chanturia et al.'s work, a small number of online investigations used recall tasks to simulate language production, with mixed results regarding whether abstract ordering constraints are active in online production. Bock and Warren (1985) did not find effects of concreteness in ordering preferences, although the number of subjects and items in their task is small relative to the numbers we will use. Kelly, Bock, and Keil (1986) and Onishi, Murphy, and Bock (2008) did find effects of prototypicality. McDonald, Bock, and Kelly (1993) found effects of animacy and prosody, but-in contrast to Pinker and Birdsong-not word length. Thus the previous work provides weak evidence for some effects of abstract ordering constraints in production. The existence of such effects in comprehension has yet to be tested.
So based on our current knowledge, it is unclear whether to attribute the processing differences found by Siyanova-Chanturia et al. to the frequency of people's direct experience with these specific expressions or to their abstract knowledge of constraints on binomial ordering. Here we adopt a two-pronged approach to address this question. We look for effects of abstract ordering constraints on novel binomial expressions, thus establishing a baseline for such effects in the absence of direct experience with the binomials in question. Additionally, we compare the processing of these novel expressions with Siyanova-Chanturia et al.'s frequently attested expressions, allowing us to assess the relative roles of abstract knowledge and direct experience in the processing of attested expressions.

Our approach and its predictions
In this section, we describe in more detail the theoretical and methodological approach that we will take to studying binomial expressions. We begin by identifying three variables whose potential effects on processing we want to consider and determining how to quantify each one.

Independent variables of interest
For a word pair ðA; BÞ, the first variable we consider is the overall frequency of binomial expressions containing these elements-in other words, the combined frequency of the expressions ''A and B" and ''B and A". To estimate the overall frequency of people's experience with these expressions, we can obtain frequency estimates from large corpora. Frequency can thus be analyzed as a continuous variable (generally measured in occurrences per million words), although in the current work we will treat it dichotomously (unattested versus frequently attested).
The next variable we consider is the relative frequency, or proportion of occurrences, of each order. Again, we can estimate this from corpus frequencies. The relative frequency of ''A and B" is the number of occurrences of ''A and B" divided by the overall frequency of ðA; BÞ binomial expressions. It is thus a real number between 0 and 1, inclusive. The relative frequency of ''B and A" is one minus the relative frequency of ''A and B".
The final variable we consider is the ordering preference due to people's abstract knowledge of binomial ordering constraints. For a given order ''A and B", we want a value between 0 and 1 corresponding to the probability of someone producing that order based on their knowledge of the abstract constraints governing binomial ordering. Unlike the previous two variables, we cannot directly estimate people's abstract knowledge from corpus frequencies.
Instead, we will build a probabilistic model based on that of Benor and Levy (2006) to give us these estimates. In this paper, we make the simplifying assumption that abstract ordering preferences are fixed for a given expression; that is, they do not depend on the local context, linguistic or otherwise. This assumption would not always hold in a more naturalistic setting: in a separate corpus analysis (Morgan, 2016, chapter 3), we find that ordering preferences for 4% of tokens are directly influenced by the local linguistic context, e.g. because one element in the pair was previously mentioned. However, our experimental materials (described in Section 3) will as much as possible avoid local contexts that would influence expression order, so we consider this a reasonable simplification for the present work.
Of these variables, the two that directly compete to explain binomial ordering preferences in online processing are relative frequency and abstract knowledge. Crucially, although these two variables may be correlated, we assume that they are not equivalent, as relative frequency can be influenced by factors beyond abstract knowledge such as conventionalization and idiomaticity, famous quotations, or language change that interacts with abstract ordering constraints (e.g. changes in word meaning or pronunciation). For example, although abstract knowledge includes a strong constraint to put men before women, ladies and gentlemen is strongly preferred to gentlemen and ladies due to its conventionalized use in formal addresses. Discrepancies between abstract knowledge and relative frequency are not necessarily limited to such extreme cases as ladies and gentlemen but may exist in subtler ways for many expressions in the language.
We further note that the roles of relative frequency and abstract knowledge in determining ordering preferences may change depending on the overall frequency of an expression: in the most extreme case, a never-before-encountered binomial by definition cannot be influenced by its relative frequency in previous experience. Our goal is therefore to measure the relative contributions of abstract knowledge and relative frequency to binomial ordering preferences, and to determine whether and how these change as a function of overall frequency.

Dependent variables of interest
We consider two measures of people's processing of binomial expressions. First, we carry out a forced-choice preference experiment in which people see both possible orders of a binomial expression and choose which they prefer. For each expression, we can then calculate the proportion of people who prefer a given order. Next, we measure reading times for expressions in each order as an online measure of processing difficulty. We thus obtain two measures indexing degree of human preference for one order over other. We can then test which combination of our proposed independent variables-overall frequency, relative frequency, and abstract knowledge-best predict the human data.

Predictions
Let us consider possible combinations of independent variables and what effects they might have on the behavioral data.
Abstract knowledge only. One possibility is that only abstract knowledge of ordering constraints influences processing. This would be the case if (a) there are no effects of direct experience with specific binomial orders (in line with a words-and-rules theory of language processing), and (b) there are online effects of ordering constraints. In this case, we predict that abstract knowledge but not relative frequency will have predictive power. More specifically, this theory predicts that abstract knowledge will be the best predictor of the behavioral data, and that its predictive power should not change as a function of relative or overall frequency.
Relative frequency only. If, as predicted by exemplar-based theories, there are effects of direct experience with specific binomial orders, then relative frequency should influence behavior for expressions that people have experience with, i.e. expressions with nonzero overall frequency. If, furthermore, abstract ordering constraints are not active in online processing, then only relative frequency should play a role. In this case we predict that novel binomial expressions will show no ordering preferences because people have no experience with them, but that relative frequency will be predictive of the behavioral data for all attested binomials. Under such a theory, relative frequency may improve as a predictor with increased overall frequency, but this would be due to having more robust estimates of relative frequency with increased overall frequency, not due to any change in the role of abstract knowledge.
Both abstract knowledge and relative frequency. If exemplarbased theories are correct that there are effects of direct experience, and moreover if abstract ordering constraints are active in online processing, then we predict that both relative frequency and abstract knowledge will be predictive of the behavioral data. For novel binomial expressions, with which people lack direct experience, abstract knowledge will be predictive. For attested expressions, some combination of abstract knowledge and relative frequency will be the best predictor (as predicted by Bod et al., 2003;O'Donnell et al., 2011;Post & Gildea, 2013).
To summarize, we investigate the roles of abstract knowledge and direct linguistic experience in the processing of both novel and frequently attested binomial expressions. We estimate people's direct experience with expressions in each possible order using corpus frequencies, and we estimate their abstract knowledge of ordering preferences using a probabilistic model. We evaluate which combination of these best predicts behavioral data in a forced-choice preference task and a selfpaced reading task.
The organization of the remainder of this paper is as follows: In Section 2, we introduce the probabilistic model used to estimate abstract knowledge of binomial ordering preferences. In Section 3, we describe the experimental materials used in our behavioral experiments. In Sections 4 and 5, we discuss two experiments. Section 6 gives a general discussion.

Probabilistic model of ordering preferences
We begin by developing a probabilistic model of binomial ordering preferences. This model integrates the constraints on ordering that have been discussed in the previous literature (as summarized by Benor & Levy, 2006), allowing us to approximate a native English speaker's abstract of knowledge of ordering preferences for a given binomial expression, independent of their direct experience with tokens of the expression.
We develop a logistic regression model following Benor and Levy. For a given word pair ðA; BÞ, this model predicts the probability that a binomial expression will be realized as A and B. We train our model on Benor and Levy's dataset, a random selection of binomial expressions drawn from a collection of corpora. 4 As Benor and Levy note, conclusions drawn from token counts rather than type counts may be skewed by the presence of a small number of very frequently attested frozen expressions (e.g. back and forth, with a token count of 49). We thus train our model on binomial types rather than tokens. This necessitated excluding expressions that appeared in both orders (15 word pairs), leaving us with 379 binomial expression types.
Benor and Levy coded their dataset for twenty potential constraints on ordering based on a thorough review of the previous literature. A constraint is said to be active for a given word pair if it favors one order over another; not all constraints are active for all word pairs. When constraints are active, they are binary-valued, favoring either word A first or word B first. Specifically, constraints are coded as 1 when they favor alphabetic order, À1 when they favor non-alphabetic order, and 0 when they are inactive. Outcomes are coded as 1 if the binomial expression appears in alphabetical order and 0 otherwise.
Benor and Levy did not do any model selection to determine which of their constraints were good predictors, although their results show that some, particularly among the nonmetrical phonological constraints, are very poor predictors. For our model, we use a subset of their constraints. Our goal is to develop the best possible model of binomial expression preferences that is nonetheless reasonably parsimonious (in particular, does not include those constraints that are clearly poor predictors), but it is not our goal to conclusively demonstrate that particular constraints are significant predictors of preferences: rather, our goal is to develop an effective predictive model that can be used to investigate the link between abstract knowledge of binomial ordering preferences and behavioral responses in offline and online processing tasks. We thus adopt relatively lenient criteria for inclusion of constraints in our final model. From Benor and Levy's twenty constraints, we begin by excluding two constraints that are rarely active in the dataset, and all expressions in which they are active: the Absolute Formal Markedness constraint (the two elements do not share a derivation, but one element is structurally more simple-i.e. contains fewer morphemes; active once) and the Pragmatic constraint (ordering is directly influenced by the local linguistic context; active thrice). With the remaining constraints, we fit a logistic regression model using the glm function in R (R Core Team, 2014). Each constraint was entered as a predictor, with no interactions between constraints. We performed backwards model selection, excluding constraints one at a time based on their Wald z statistic, until all remaining constraints had p < 0:15. 5,6 Our final model contains seven constraints. All affected the model's predicted ordering preference in the direction expected by Benor and Levy or by the sources who first proposed the constraint. See Table 1 for details of the constraint weightings. The constraints included in our final model are (with examples of binomials that satisfy each constraint drawn from the training data): Formal markedness. The word with more general meaning or broader distribution comes first. For example, in boards and two-by-fours, boards are a broader class of which two-by-fours is one member. Perceptual markedness. Elements that are more closely connected to the speaker come first. This constraint encompasses Cooper and Ross's (1975) 'Me First' constraint and includes numerous subconstraints, e.g.: animates precede inanimates; concrete words precede abstract words. For example, in deer and trees, deer are animate while trees are inanimate. Power. The more powerful or culturally prioritized word comes first. For example, in clergymen and parishioners, clergymen have higher rank within the church. Iconic/scalar sequencing. Elements that exist in sequence should be ordered in sequence. For example, in achieved and maintained, a state must be achieved before it can then be maintained.
No final stress. The final syllable of the second word should not be stressed. For example, in abused and neglected, abused has final stress and should therefore not be in the second position. Frequency. The more frequent word comes first, e.g. bride and groom.
Length. The shorter word (measured in syllables) comes first, e.g. abused and neglected.
The dataset on which we originally trained our model contained seven binomial expressions that were also included in Siyanova-Chanturia et al.'s (2011) items, which we later use as test items. Therefore, after doing model selection on the original dataset, we retrained our model, excluding these seven items from the training data. All results, beginning with Table 1, are reported based on the retrained model.

Model validation
We validate the model by testing its predictions on the training data and on the 42 attested binomials used by Siyanova-Chanturia et al. (2011). Constraint values for the Siyanova-Chanturia et al. binomials were hand-coded as described in Section 3. The model correctly predicts the ordering preferences for 287/372 (77%) of the training data and 30/42 (71%) of Siyanova-Chanturia et al.'s items, both significantly greater than chance (50%) in a twotailed binomial test (p < 0:001 and p < 0:01).

Experimental materials
Using our probabilistic model, we develop the linguistic stimuli used in both experiments. Our stimuli consisted of 84 word pairs, with each pair producing two possible binomial expressions (A and B or B and A). 42 of our items, taken directly from Siyanova-Chanturia et al. (2011), are frequently attested. They range from almost completely frozen (e.g. bread and butter) to relatively flexible (e.g. radio and television/television and radio).
We further created 42 novel items which our model predicts to have strong ordering preferences (e.g. bishops and seamstresses/ seamstresses and bishops). To ensure that speakers have no prior experience with these expressions, we consult the nearly 500billion-word Google books n-gram corpus (Lin et al., 2012). Our novel binomials are not included in this corpus in either order. 7 Our probabilistic model gives us an estimate of the direction and strength of ordering preference for each item based on abstract ordering constraints. To generate model predictions for these items, we must code them for the seven constraints described in Section 2. Final Stress and Length were coded by either the first author or a trained research assistant, both native speakers of American English. Frequency estimates were obtained from the HAL database via the English Lexicon Project (Balota et al., 2007). 8 Coding the remaining four constraints requires real-world knowledge, and so they were coded twice, independently, by the first author and a trained research assistant. Conflicting judgments were resolved through discussion; with discussion, the two coders were always able to reach agreement.
As predicted in Section 1.2.1, our attested items show a significant but not perfect correlation between model-predicted abstract ordering preference and relative frequencies (computed from the Google n-grams corpus; Brants & Franz, 2006): rð40Þ ¼ 0:59; p < 0:0001. This relationship is visualized in Fig. 1.
For our novel binomials, we chose expressions that our model predicts to have strong ordering preferences, with values less than 0.3 or greater than 0.7. As much as possible, we chose expressions that minimized the correlations between constraints (e.g. to dissociate length and frequency). A comparison of the profiles of constraint activity for novel and attested items is given in Appendix B.
For all items, both novel and attested, we constructed a sentence context for the binomial expression, e.g.: (3) There were many bishops and seamstresses in the small town where I grew up. (4) There were many seamstresses and bishops in the small town where I grew up.
Sentence structure was unrestricted, but the binomial expression was never in the first two or the last four words of the sentence. Sentences were designed not to introduce pragmatic constraints on binomial ordering: in particular, neither binomial element (nor any word related exclusively or primarily to only one of the elements) was mentioned in the sentence before the binomial occurred. 5 We made one exception by keeping the Iconic Sequencing constraint in our model, although it had a high p value. This constraint was never violated in our dataset, and estimation of the Wald z statistic is unreliable in cases such as this with large estimated coefficients, due to inflated standard error estimates (Agresti, 2002;Menard, 2002). A likelihood ratio test supports our keeping this constraint in the model. (See Table 1.) 6 Backwards model selection is anti-conservative (Harrell, 2001), but this is not a problem in light of the desire for leniency discussed above, as we are not attempting to draw strong conclusions about which particular constraints influence preferences.
In terms of the effects on later results, including irrelevant constraints in our predictive model would add noise to our abstract ordering preference estimates, making it harder to detect effects of abstract knowledge. As we do ultimately find effects of abstract knowledge, any noise introduced at this stage was apparently not substantial enough to counteract these findings. 7 Levy, Fedorenko, Breen, and Gibson (2012) estimate that college-age English speakers have been exposed to no more than 350 million words of English in their lifetimes. To be included in the Google books corpus, an n-gram must have appeared at least 40 times in their 468,491,999,592 word corpus. Thus our binomials can have appeared at most 39 times in this corpus, and there is at most a roughly 3% chance that a college-age speaker would have heard any given one of these expressions. Although our participants are on average slightly older than college-age, we believe there is still an exceedingly small chance that they will have substantial experience with any of these expressions. 8 On three occasions, one word in a pair was not in the English Lexicon Project database (groundskeeper, ninety-eighth, and wildfires). In these cases, the non-included word was assumed to be the less frequent.
With these materials, we carried out two behavioral experiments, a forced-choice preference experiment and a self-paced reading experiment. Turk, restricted to people connecting to the website from within the United States, and were paid 50 cents. Participants were asked to report their ''Native language (what you learned to speak with your mother as a child)". Those who did not report English among their native languages were excluded.

Procedure
The Amazon Mechanical Turk instructions directed participants to an external website, where our experiment was presented using WebExp (Keller, Gunasekharan, Mayo, & Corley, 2009). Participants first filled out a demographic questionnaire, then continued to the main experiment. On each trial, participants saw one item embedded in sentence context, in both possible orders, e.g.: There were many bishops and seamstresses in the small town where I grew up. There were many seamstresses and bishops in the small town where I grew up.
Participants were asked to choose which order ''sounds more natural". Each participant saw all 84 items. Which expression order was listed first was counterbalanced across participants. Order of item presentation was randomized separately for each participant. The experiment typically took 10-15 min.

Results
Before proceeding with our main multiple regression analysis of the effects of abstract knowledge and direct experience on ordering preference, we present a striking overall difference between the distributions of preference strengths for attested versus novel binomials. Fig. 2 shows that ordering preferences are more polarized for attested than for novel binomials (despite the fact that we selected our novel binomials to have extreme preferences); in other words, preferences are more consistent across subjects for the attested expressions. We define a measure of extremity for each item as the difference between its experimentally determined preference strength (i.e. proportion of times preferred in alphabetical order) and 0.5. In a t-test, the attested items are significantly more extreme than the novel (t ¼ 8:31; p < 0:001). We discuss this issue further in Sections 4.3 and 6.3.

Multiple regression analysis
Next we analyze our data using mixed-effects logistic regression (Jaeger, 2008). Our dependent variable is the preferred order, coded as alphabetical or non-alphabetical: alphabetical order is used as a neutral order because results of our initial model selection-see Section 2-indicate that alphabetical order is not a significant predictor of ordering preference. Our independent (fixedeffect) predictors are: Type (attested/novel) is treatment coded with ''attested" as the reference level, i.e. the Intercept value applies to attested items, and this value is adjusted by the Type:novel value for novel binomials. We predict no significant intercept (i.e. attested binomials are not significantly more likely to be preferred in alphabetical or non-alphabetical order, absent other factors), and no significant effect of type (i.e. novel binomials are not significantly more or less likely to be preferred in alphabetical order than attested binomials). Abstract knowledge is operationalized as our model's predicted probability (between 0 and 1) of the expression occurring in alphabetical order. We center this predictor around 0.5. We nest the abstract knowledge predictor within type, i.e. we fit separate parameters for the effect of abstract knowledge for novel and attested binomials, allowing us to consider the effects of abstract knowledge on each type independently. For each type, if abstract ordering constraints are active in influencing offline judgments, then we predict a significant effect of abstract knowledge. Relative frequency estimates are computed for attested binomials using the Google n-grams corpus (Brants & Franz, 2006) Table 1 Constraint weights in our probabilistic model. In addition to reporting the Wald z statistic and p-values based on it (columns 3-4), we report results of a likelihood-ratio test comparing versions of the model differing only in whether they include the constraint in question (and containing all other constraints; columns 5-6).

Constraint
Regression coeff.  as the frequency of ''A and B" divided by the frequency of ''A and B" plus ''B and A" (resulting in a value between 0 and 1), and centered around 0.5. Relative frequency for all novel binomials is set to 0 after centering. (Thus no interaction of relative frequency with type is necessary.) If direct experience with attested expressions influences offline judgments, then we predict a significant effect of relative frequency.
Following Barr, Levy, Scheepers, and Tily (2013), we use the maximal random effects structures for subjects and items justified by the experimental design: by-subject and by-item intercepts, and by-subject slopes for type, abstract knowledge, their interaction, and frequency. Model results are given in Table 2. 9 Significance levels for effects are reported using the Wald z statistic and are confirmed using like-lihood ratio tests. We see a significant effect of abstract knowledge for both novel and attested expressions, demonstrating that abstract ordering constraints are active in determining forced-choice preferences for both binomial types. In a likelihood ratio test comparing this model to a model with only an additive (non-nested) fixed effect of abstract knowledge, we find no significant difference (v 2 ð1Þ ¼ 1:63; p ¼ 0:20); in other words, the effect of abstract knowledge does not differ significantly between novel and attested expressions. The effect of abstract knowledge for novel binomials is displayed in Fig. 3.
We also see a significant effect of relative frequency, demonstrating that direct experience also plays a role in determining preferences for attested expressions. We note that relative frequency is a stronger predictor than abstract knowledge, measured in terms of larger regression coefficient estimate, larger z value, and larger change in likelihood when removed from the model. The strong predictive power of relative frequency is displayed in Fig. 4.

Discussion
In this experiment, we set out to test whether abstract knowledge and direct experience (specifically, relative frequency) predict ordering preferences in a forced-choice preference task for both novel and frequently attested binomial expressions. We demonstrate that preferences of both attested and novel expressions are affected by abstract knowledge and that preferences of attested expressions are also strongly predicted by relative frequency. This pattern of results supports a theory wherein both abstract knowledge and direct experience play a role in processing. Moreover, for attested expressions, we find that relatively frequency is a stronger predictor of preferences than abstract knowledge, suggesting that processing of these expressions relies more heavily upon direct experience than upon abstract knowledge.
Although the effect of abstract knowledge does not differ significantly across binomial types, we do not think it is justified to draw strong theoretical conclusions from this null result. As we will see in Section 5.2.2, abstract knowledge does interact significantly with binomial type in Experiment 2. We defer further discussion of this issue until Section 5.3.
We additionally find that forced-choice preferences are more extreme for attested than for novel expressions; that is, attested expressions are more consistently preferred in one direction than novel expressions. Taken at face value, this finding suggests that increased overall frequency of an expression exaggerates or solidifies people's preferences. Another possibility, however, is that preferences for novel expressions are underlying equally as extreme as those of the attested expressions, but that the forcedchoice judgement process for these items is noisier, 10 making the resulting preferences for novel expressions appear less extreme than they truly are. We will return to this question in the general discussion.
One potential confound mentioned earlier is the role of local sentence context on binomial order preferences. Although we tried to avoid biasing contexts in designing our materials, it is always possible that some bias unintentionally slipped through. However, even if such bias does exist within individual sentences-i.e. the sentence context favors one order more than another, relative to the binomials' intrinsic ordering preference in a hypothetical neutral context-it would not confound the results presented here. Specifically, because our dependent variable is an alphabetical versus non-alphabetical preference, in order to bias our results the In judgments, attested binomials have more extreme preferences (i.e. more consistent across subjects) than novel binomials, demonstrating a qualitatively similar distribution to corpus frequencies. 9 The model presented here includes all the fixed-effect predictors and interactions that are of crucial theoretical interest for the hypotheses we set out to test. In order to explore possible further interactions between predictors, as well as possible changes in behavior over the course of the experiment, we fit a mixed-effects logistic regression including as predictors all the previous predictors, a trial order predictor, and all two-way interactions, using the MCMCglmm package in R (Hadfield, 2010). (The trial order predictor was not included in the original model presented here because a main effect of trial order is implausible, as it would indicate a changing probability of preferring binomials in alphabetical order over the course of the experiment. However, its interaction with other predictors-in particular, abstract knowledge and relative frequency-is potentially of interest.) No further interactions (beyond the type Â abstract knowledge interaction included in the original model) reached significance. 10 There are many reasons why this could be the case. For instance, when judging attested items, participants may believe that there is a ''right" answer and take care to give that answer, whereas when judging novel items, they may put in less effort.
local context biases would need to be systematically correlated with the alphabetical/non-alphabetical preferences as given by our predictors of interest (abstract knowledge and relative frequency). Since we have no reason to expect this to be the case, any unintentional effects of local context will merely add noise to our estimates of ordering preferences. In the next experiment, we ask whether the patterns found in our forced-choice preference experiment likewise hold in an online reading experiment.
5. Experiment 2: Self-paced reading 5.1. Method 5.1.1. Participants 400 native English speakers (mean age = 34 years; sd = 12) participated. Experiment 2 required substantially more participants than Experiment 1 because the self-paced reading data are noisier than the forced-choice data and because, as described in Section 5.1.2, each subject saw approximately half the items in Experiment 2, compared to all the items in Experiment 1. Participant recruitment was identical to Experiment 1, except that participants were paid $1.00.

Procedure
The experiment was presented within Amazon Mechanical Turk using flexspr (Tily, 2012;previously used by Bergen, Levy, & Gibson, 2012;Linzen & Jaeger, 2015;Singh, Fedorenko, Mahowald, & Gibson, 2015). Using this method online allows for collection of more data than would be possible in a laboratory setting, and previous work has replicated multiple in-the-lab results with web-based self-paced reading (Enochson & Culbertson, 2015). Participants first filled out a demographic questionnaire, then read sentences in a self-paced reading paradigm: sentences were presented one word at a time, and participants pressed a button to advance to the next word. Reading times for each word were recorded. Participants read three practice sentences, then continued to the main experiment.
Our materials consisted of the same 84 binomial expressions in sentence context as used in Experiment 1, plus 84 unrelated filler sentences. Two stimulus lists were constructed with items rotated and counterbalanced between lists so that each participant only saw a given binomial in one of its two possible orders. Due to a programming error, out of the 168 items in each list, each participant saw a random selection of 80 items. Order of presentation was randomized separately for each participant.
Presentation of each sentence was followed by a yes/no comprehension question. Answers did not depend on the order of the binomial expression. The experiment typically took about 30 min.

Comprehension question accuracy
Comprehension question accuracy is extremely high across all conditions. See Table 3.

Multiple regression analysis
We use regression analysis to compare abstract knowledge and relative frequency as predictors of reading times, analogous to our analysis in Experiment 1.
We divide our experimental items into regions of analysis as shown in Table 4. The Prelim region encompasses the entire beginning of the sentence up to the binomial expression; all further regions are a single word. We analyze reading time data for each trial summed over a six-word region spanning from Word1 through Spill3. By summing across reading times for these regions, we take advantage of the controlled properties of our stimuli: regardless of order of binomial presentation across conditions, participants will have read the same group of words within the region being analyzed. (For more direct comparison with the previous literature, we present word-by-word analyses of reading times in Appendix C.) Specifically, we computed a summed reading time measure for each trial as follows: we excluded all trials in which the reading time for any word was less than 100 ms or greater than 5000 ms. To account for influences of word length, as described by Ferreira and Clifton (1986), we then computed subject-specific residualized reading times (regressed against word length) for each word from the Word1 through Spill3 regions, using data from all nonsentence-final words in non-practice trials. 11 Summing the residuals for this six-word region gives us a residual reading time for each x values are the abstract knowledge model's prediction for how often the item will appear in alphabetical order. y values are how often the item was preferred in that order. Line shows best linear fit on the by-items aggregated data. Abstract knowledge is a significant predictor of preferences for novel expressions. trial. We performed outlier removal without regard to item type or condition: we computed a grand mean and standard deviation and exclude trials with summed times more than 2.5 standard deviations above or below the mean, resulting in a loss of 1.7% of data. We analyze the data using a mixed-effects linear regression similar to that used in Experiment 1. Our dependent variable is summed residual reading time. Our independent (fixed-effect) predictors and their interpretations are identical to those used in Experiment 1 (Section 4.2.1) with one addition: Trial order is the position in the experiment in which the given trial occurred. As is common in reading experiments (e.g. Fine, Jaeger, Farmer, & Qian, 2013;Hofmeister, Jaeger, Arnon, Sag, & Snider, 2013 and many others), we expect that subjects will read faster later in the experiment due to practice effects.
In addition to our hypotheses regarding possible influences of abstract knowledge and direct experience on reading times (which are the same as in Experiment 1), we additionally anticipate a possible statistically significant but theoretically uninteresting main effect of binomial type because the two types contain different words in different sentence frames, and thus one type may be read faster than the other. Following Barr et al. (2013), we use the maximal random effects structure for subjects as justified by the experimental design, namely an intercept and slopes for type, abstract knowledge, their interaction, and relative frequency. We also include a by-subjects random slope for trial order. For items, defined as unordered word pairs, we include a random intercept, a random slope for trial order, and (in place of random slopes for both abstract knowledge and relative frequency) a random slope for a binary alphabetical/non-alphabetical factor, thus allowing for arbitrary item-specific ordering preferences.
Model results are given in Table 5. 12 Effects with t ! 2 are taken to be significant. Positive coefficients indicate slower reading. We see a significant main effect of type with novel expressions read slower, which we attribute to these expressions containing less frequent words on average, in addition to being less frequent expressions overall.
We do not find a significant effect of abstract knowledge for attested expressions, suggesting that abstract ordering constraints are not active in the online processing of these expressions. However, we do find a significant effect of abstract knowledge for novel expressions. In a likelihood ratio test comparing this model to a model with only an additive (non-nested) effect of abstract knowledge, we find a significant difference (v 2 ð1Þ ¼ 4:24; p < 0:04); in other words, the effect of abstract knowledge differs significantly between novel and attested expressions, playing a significant role in online processing for novel expressions only. We additionally x values are the abstract knowledge model's prediction for how often the item will appear in alphabetical order. y values are the item's relative frequency of appearing in that order. Points' shading (white to black) shows often the item was preferred in that order. Background shading (light to dark orange) shows the best-fit model (Table 2) prediction for how often the item was preferred in that order. Both relative frequency and abstract knowledge predict true preferences, as depicted by the diagonal background gradient but relative frequency is the stronger predictor, as depicted by the stronger vertical than horizontal gradient. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) find a significant effect of relative frequency, demonstrating that higher relative frequency leads to faster reading in the online processing of attested expressions. Finally, we find a significant effect of trial order, with faster reading later in the experiment. Results are visualized in Figs. 5 and 6.

Discussion
We demonstrate for the first time that novel binomial expressions show online effects of abstract ordering preferences. In contrast, reading times for frequently attested binomial expressions are only influenced by relative frequency. These findings imply a trade-off in online processing between reliance on abstract knowledge and direct experience, where novel expressions must be processed on the basis of abstract knowledge only, but highly frequent attested expressions can be processed primarily with reference to previous direct experience.
Here we found a significant interaction of abstract knowledge with binomial type, such that abstract knowledge was significantly less active in determining reading times for attested binomials than for novel binomials. In contrast, in Experiment 1, we found no such significant interaction. What is consistent across these two experiments is that processing of attested expressions is more strongly influenced by direct experience than by abstract knowledge. However, given the inconsistent results concerning the interaction of abstract knowledge and binomial type, we cannot state with confidence whether abstract knowledge is differentially active between novel and attested binomials.

General discussion
We set out to investigate the roles of abstract knowledge and direct experience in the processing of binomial expressions, asking whether binomial ordering preferences are driven by constraints on the semantic, phonological, and lexical properties of words in an expression, or by prior experience with the specific expression in question. Our key findings are as follows. First, we demonstrated that abstract ordering constraints are active in the comprehension of novel expressions in both an offline forced-choice task and a online self-paced reading task. Second, we demonstrated that for frequently attested expressions, effects of direct experience largely overwhelm abstract knowledge in predicting behavioral data, both in the offline task and especially in the online task.
Our results support exemplar-or usage-based theories of language, which allow for the storage and reuse of multi-word expressions. Specifically, our finding that ordering preferences for attested binomial expressions are primarily driven by relative frequency is evidence that the processing of these expressions makes use of holistic multi-word mental representations. In contrast, a traditional words-and-rules theory would predict that these expressions are generated compositionally each time they are encountered, and that the ordering preferences of attested expressions, like those of novel expressions, should stem from abstract ordering constraints rather than relative frequency of direct experience.
Of the predictions made in Section 1.2.3, our results indicate that both abstract knowledge and relative frequency play a role in the processing of binomial expressions. Many patterns are possible for the manner in which these two knowledge sources trade off as a function of the overall frequency of an expression: In one extreme, abstract knowledge could apply only for expressions that have never before been encountered, with relative frequency taking over as soon as any direct experience exists. In the other extreme, abstract knowledge could apply in the vast majority of cases, with relative frequency limited to playing a role only for the highest frequency items, such as those used in our experiments. A middle ground position proposes a gradual switch from reliance on abstract knowledge to reliance on relative frequency as overall frequency increases.
We propose that both extremes are unlikely and that the middle position of a gradual trade-off is the most likely. The first extreme is counterintuitive, since a single encounter with an expression seems insufficient to thoroughly trump abstract knowledge. The second extreme has been argued against by Arnon and Snider (2010), who found frequency effects for multi-word expressions across a wide range of frequencies. Their finding of frequency effects for low-to-medium frequency items would not be predicted by a theory in which direct experience applies only to the processing of extremely high frequency items. The gradual trade-off theory, on the other hand, is supported by a wide variety of computational models. Abstract knowledge model-predicted proportion Residual RT differential (ms) x values are abstract knowledge model's predictions for how often the item will appear in alphabetical order. y values are the differences between average summed residual reading times for the non-alphabetical and alphabetical orders. Line shows best linear fit on the by-items aggregated data. Abstract knowledge is a significant predictor of reading times.
6.1. Convergent evidence from computational models 6.1.1. Connectionist models A similar trade-off has been demonstrated in connectionist models of language learning in domains such as past-tense formation (Rumelhart & McClelland, 1986) and grammatical structure (Elman, 2003), which learn both generalized patterns and specific exceptions. These models learn to predict patterns within their training data (e.g. Form the past tense by adding -ed). When new items are introduced, they are at first treated accorded to the general patterns, but with further training, the model can learn to treat certain items as exceptions.
These models have primarily been conceived as models of early language acquisition and tested on frequent items (e.g. common verbs), where it can assumed that by adulthood, most native speakers will have extensive experience with all the items in questions, and will thus consistently recognize certain words as exceptions to the general rules. However, their behavior on new items straightforwardly generalizes to low frequency items that even adult native speakers would have relatively little direct experience with, such as attested but low frequency binomial expressions, making the prediction that these items could occupy a middle ground of partial reliance upon both general patterns (i.e. abstract knowledge) and direct experience, even in a fully developed adult grammar.

Exemplar-based computational models
A gradual trade-off is also predicted by a particular class of exemplar-based computational models of language: namely, those that incorporate representations of fragments varying in size, and that allow not only for holistic reuse of the largest fragments but also for rule-based composition of smaller fragments (e.g. Bod, 1998Bod, , 2008Bod et al., 2003;Demberg, 2010;Johnson et al., 2007;O'Donnell et al., 2011). Within these models, multi-word expressions can thus be parsed both through direct reuse and through compositional generation. The probabilities assigned to these units-the holistic expressions, the individual words, and the compositional rules-will collectively determine the relative likelihoods of reuse versus regeneration. For more frequent expressions, the probability of reusing a holistic unit will be higher, while for less frequent expressions, the probability of compositional generation will be higher. These probabilities change gradually depending on the frequency of a given expression as well as the frequencies of similar expressions. These models thus also predict a gradual trade-off between reliance on abstract knowledge for infrequent items and reliance upon direct experience for frequent items.

Nonparametric Bayesian models
The gradual trade-off theory is also supported by a nonparametric Bayesian perspective (e.g. Goldwater, Griffiths, & Johnson, 2009;Xu & Tenenbaum, 2007), in which expectations are influenced by both a prior probability and the incoming data. In a Bayesian model, when little data has been seen, expectations are driven by the prior probability. As more data is seen, the data becomes increasingly influential, asymptotically approaching complete dominance. For binomial expressions, abstract knowledge can be thought of as a prior probability for ordering preferences, absent any direct experience with a given expression, and each direct encounter with an expression constitutes further data. Under the Bayesian perspective, when one has little experience with an expression, expectations will be governed by abstract knowledge, but with increasing experience, the relative frequency of ordering within the experienced data will be increasingly dominant in determining expectations. Actual RT differential (ms) Fig. 6. Results of Experiment 2 (attested items), visualized as colors overlaid in Fig. 1. Each point represents an item. x values are the abstract knowledge model's prediction for how often the item will appear in alphabetical order. y values are the item's relative frequency of appearing in that order. Points' shading (white to black) shows the item's true average RT differential. Background shading (light to dark orange) shows the best-fit model (Table 5) prediction for RT differential. Only relative frequency is a significant predictor of reading times, as depicted by the strong vertical background gradient. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Advantages of our approach
While numerous models support our conclusions, the experiments presented here crucially advance the state of our understanding beyond what was previously known by providing a novel approach for using behavioral evidence, in conjunction with modern corpora and statistical techniques, to quantify the contributions of abstract knowledge and direct experience. Our probabilistic model provides quantitative estimates for the effects of abstract knowledge, while corpus frequencies provide estimates for direct experience. Using multiple regression modeling, we can directly compare the predictive strength of these two influences on behavioral data such as the results of our forced-choice and self-paced reading tasks. This approach allows us to move beyond the previous modeling-based approaches, which focused on predicting corpus data or language-wide trends. We can now investigate the trade-off between abstract knowledge and direct experience using behavioral evidence.
Additionally, the statistical techniques employed here allow us to make quantitative claims about the strength of reliance on both abstract knowledge and direct knowledge. We have seen this in a limited way so far, as we demonstrated that processing of frequently attested binomials is driven primarily by relative frequency, and only to a lesser degree by abstract knowledge. We have also predicted that there should be a gradual shift from reliance upon abstract knowledge to reliance upon relative frequency estimates as overall frequency increases; however, we cannot conclude this directly from our current data because overall frequency has only been explored as a dichotomous variable: either entirely novel or very frequent. In future work, we plan to look at an inbetween zone of attested but not highly frequent expressions, e.g. sunglasses and sunscreen/sunscreen and sunglasses (1=1000th the frequency of the average attested expression in the current study). We predict that these expressions should show noticeable effects of both abstract constraints and relative frequency. Moreover, looking over a range of overall frequencies, we predict that we will see a quantitative trade-off between reliance on abstract knowledge and reliance on direct experience. This approach to studying the trade-off between abstract knowledge and direct experience generalizes beyond the study of binomial expression ordering preferences. The cornerstone of this approach is that we are able to independently quantify the contributions of direct experience with specific expressions and abstract knowledge in the absence of direct experience. We propose that a combination of corpus frequencies and probabilistic modeling can provide such estimates for a wide range of linguistic constructions (e.g. the dative alternation [Bresnan, Cueni, Nikitina, & Baayen, 2007] and adjective ordering [Dixon, 1982;Truswell, 2009]) allowing us to ask broad questions about the trade-off between compositional generation and the reuse of stored expressions in linguistic processing. For example, to what extent are adjective ordering preferences due to abstract rules (e.g. shape before color) versus to known collocations of highly frequent adjective sequences? The methods we have developed here make these questions accessible for future research.

Further predictions about language structure
Our results additionally lead to predictions about language structure. Our gradual trade-off theory predicts that items with higher overall frequency will be more likely to have relative frequency preferences that contradict abstract knowledge preferences. This prediction is analogous to the finding that more frequent verbs are more likely to be irregular (Bybee, 1985;Lieberman, Michel, Jackson, Tang, & Nowak, 2007): in the case of high overall-frequency items, people have enough exposure to learn idiosyncratic or abstract-knowledge-violating preferences, but in the case of low overall-frequency items, people have insufficient exposure to overcome their abstract knowledge. A further prediction follows from the results of Experiment 1, in which we found that preferences for attested items were more extreme, or polarized, than preferences for novel items. Assuming that preferences for attested items are driven primarily by relative frequency, this result predicts that as overall frequency increases, relative frequencies will become more polarized. 13 In related work, we found that these predictions were borne out in a corpus analysis (Morgan & Levy, 2015), which demonstrated that binomial expressions with higher overall frequency have relative frequencies that deviate more from abstract knowledge-in particular, by being more polarized. This finding in turn leads to further questions about the historical trajectories of binomial expression ordering preferences, and the dual roles of individuals' language processing and cultural transmission in shaping language structure (Kirby, Dowman, & Griffiths, 2007;Morgan & Levy, 2016). Thus the results presented here additionally open the door to further investigation of the mutually constraining processes of synchronic language processing and diachronic language change.
6.The elephants at the zoo were beautiful and stinky so the children loved them.
Were there elephants at the zoo? 7. The engineer specialized in making bicycles and robots when he worked for the company. Did the engineer specialize in destroying things? 8. There were many bishops and seamstresses in the small town where I grew up. Did I grow up in a small town? 9. The berries were bitter and purple when I ate them this morning. Did I eat berries this morning? 10. Seth told me that there are blankets and kittens in that box over there.
Were there blankets in the box? 11. The rangers seemed to act like campfires and wildfires were the same thing. Did I hear about fires from a policeman? 12. At the wizard school, chanting and enchanting were very common occurrences. Did the wizards ride broomsticks frequently? 13. When I met many chauffeurs and stewardesses at a party, I started questioning my job.
Did I go to a party? 14. The third grade class saw cherries and llamas at the state fair. Did the class go to the state fair? 15. There was nothing but chickens and fences in the field behind the house. Was the field behind the house? 16. His uncles were all coroners and senators in their day jobs, but they all wanted to get into the movie industry.  B.7 shows the proportion of items for which each constraint is active (recalling that each constraint can be active or inactive for a given expression). As we can see, constraints are active approximately equally often in each group. Tables B.5 and B.6 show correlations between constraints: constraint activity is coded as 1 if it predicts that an expression should occur in alphabetical order and À1 if it predicts that an expression should occur in nonalphabetical order, or 0 for inactive constraints. We see that, for both novel and attested expressions, most constraints are not highly correlated. One noteworthy exception is Length and Final Stress, which are highly correlated because single-syllable words are as short as possible (hence should come first according to Length) and necessarily have final stress (hence should come first according to Final Stress).

Appendix C. Experiment 2 region-by-region analyses
Here we present region-by-region analyses of the self-paced reading data from Experiment 2. Our goals in these analyses are to replicate the results of Siyanova-Chanturia et al. (2011) that attested binomial expressions are read faster in their preferred order, and to demonstrate that this finding extends to novel expressions when categorized into preferred/dispreferred orders on the basis of abstract knowledge. Specifically, we analyze reading times by dichotomizing binomials into preferred/dispreferred conditions, rather than using continuous abstract knowledge and relative frequency predictors as in Section 5.2.2. For simplicity of presentation, and because we are not concerned here with comparisons across binomial types, we analyze each type (attested/novel) separately.
Residualization on word length and outlier removal are identical to that reported in Section 5.2.2, except that outlier removal was done for each region and each binomial type separately (because each region within each type is analyzed separately in this section).
For each binomial type and region, we fit a linear mixed-effects regression model with residualized reading times (in milliseconds) as the dependent variable. Our independent predictor of interest is a dichotomous preferred/non-preferred variable (treatment coded with ''preferred" as the reference level). Details of how preferred order is assessed vary between binomial types and are discussed in more details below. Trial order is also included as a predictor. Following Barr et al. (2013), we use the maximal random effects structure for subjects justified by the experimental design, namely an intercept and a slope for preferred/non-preferred order. We also include a random by-subjects slope for trial order. For items, defined as unordered word pairs, we use an intercept and a slope for a binary alphabetical/non-alphabetical factor (comparable to that used in Section 5.2.2). Results for the predictor of interest are shown in Table C.7.
Novel expressions. For novel expressions, we assign each expression a preferred and non-preferred order on the basis of our abstract knowledge model's prediction for ordering preferences. Results are shown in Fig. C.8. As seen in Table C.7, we find significant effects of order at the Word1 and Word2 regions, with preferred read faster than non-preferred.
Attested expressions. For attested expressions, we consider two ways to sort expressions into preferred and non-preferred order: we can use corpus frequencies, replicating Siyanova-Chanturia Table C.7 Means, standard errors, and t values for the estimated coefficient of the preferred/dispreferred predictor in the region-by-region analyses of the self-paced reading experiment. t values greater than 2 are taken to be significant.   We begin by showing results with preferred/non-preferred determined by corpus frequencies as reported by  Results are shown in Fig. C.9. We find significant effects of order at the And, Word2, and Spill1 regions, with preferred read faster than non-preferred. 15 Next we analyze our attested expressions as sorted by abstract knowledge model predictions. Results are shown in Fig. C.10. We find a significant effect of order at the Spill1 region, with preferred read faster than non-preferred.
Discussion. We replicate Siyanova-Chanturia et al.'s (2011) finding that attested binomial expressions are read faster in their preferred order. We also demonstrate for the first time that novel binomials show online effects of abstract constraints on ordering, with faster reading times in our model's predicted preferred direction.
We do not present a region-by-region version of the multiple regression analyses presented in Section 5.2.2 because we do not expect the results seen there to hold at each region individually. As noted in Section 5.2.2, the analyses presented there took advantage of the fact that within the six-word region analyzed, participants read the same set of words regardless of order of binomial presentation. Within the word-by-word analyses presented here, however, words differ across conditions: Word1 in the preferred condition becomes Word2 in the dispreferred condition, and vice versa (e.g. ''bishops and seamstresses" versus ''seamstresses and bishops"). Moreover, recall that effects of lexical frequency are one component of abstract knowledge (Section 2), such that binomials in preferred order on average have a more frequent word preceding a less frequent word, while binomials in dispreferred order on average have a less frequent word proceeding a more frequent word. Thus, on the basis of lexical frequency alone, we would expect to see the preferred order read faster around Word1 (or shortly thereafter, due to spillover), and the dispreferred order read faster around Word2 (or shortly thereafter). In other words, on the basis of lexical frequency alone, we would expect to see a local reversal of the effect of abstract knowledge around Word2 (although we expect this reversal to be smaller in magnitude than the overall benefit of conforming to abstract knowledge across the binomial as a whole). This prediction is born out numerically in the Spill1 region for novel binomials, although it does not approach significance.

Appendix D. Experiment 2 results with raw reading times
Here we replicate the analyses presented in Section 5.2.2 with raw rather than word-length-residualized reading times. Model results are given in Table D.8. Crucial effects are very similar to those seen in Table 5. In a likelihood ratio test comparing this model to a model with only an additive (non-nested) effect of abstract knowledge, we find a marginally significant difference (v 2 ð1Þ ¼ 3:12; p ¼ 0:08). We attribute the lower significant level here compared to that presented in Section 5.2.2 to presence of extra noise in the raw compared to the residualized reading time data.