A new test for exemplar theory: Varying versus non-varying words in Spanish

We used a Deese-Roediger-McDermott false memory paradigm to compare Spanish words in which the phonetic realization of /s/ can vary (word-medial positions: bu[s]to ~ bu[h]to ‘chest’, word-final positions: remo[s] ~ remo[h] ‘oars’) to words in which it cannot (word-initial positions: [s]opa ~ *[h]opa ‘soup’). At study, participants listened to lists of nine words that were phonological neighbors of an unheard critical item (e.g., popa, sepa, soja, etc. for the critical item sopa). At test, participants performed free recall and yes/no recognition tasks. Replicating previous work in this paradigm, results showed robust false memory effects: that is, participants were more likely to (falsely) remember a critical item than a random intrusion. When the realization of /s/ was consistent across conditions (Experiment 1), false memory rates for varying versus non-varying words did not significantly differ. However, when the realization of /s/ varied between [s] and [h] in those positions which allow it (Experiment 2), false recognition rates for varying words like busto were significantly higher than those for non-varying words like sopa. Assuming that higher false memory rates are indicative of greater lexical activation, we interpret these results to support the predictions of exemplar theory, which claims that words with heterogeneous versus homogeneous acoustic realizations should exhibit distinct patterns of activation.


Introduction
Exemplar theory's fundamental claim is that listeners store each instance of a word in memory, retaining details such as social context and talker identity (Pierrehumbert 2016). Previous research has provided substantial evidence for this claim, primarily by focusing on voices. For example, several studies have shown that listeners respond to words more quickly and accurately when they are spoken in a familiar voice, compared to an unfamiliar one (Church & Schacter 1994;Nygaard, Sommers & Pisoni 1994;Goldinger 1996;Bradlow, Nygaard & Pisoni 1999). These findings make sense only if listeners retain voice information, and are at odds with traditional, abstract notions of lexical representations in which phonetic details are discarded (e.g., Chomsky & Halle 1968). Given its emphasis on such details, one of the challenges for exemplar theory has been to provide a mechanism by which listeners can nevertheless make generalizations and recognize highly variable speech input as words. In several proposals, this is accomplished with clustering (K. Johnson 1997Sumner et al. 2014;Pierrehumbert 2016). The basic idea is that exemplars which are similar to each other, such as those containing the same sequence of segments or those produced by the same talker, cluster together in acoustic space. During word recognition, stored exemplars become activated according to how acoustically similar they are to the speech input. Thus, for example, any stored exemplar will become activated if it is acoustically similar to an input such as [kaet], and will activate even more strongly if it contains similar voice details. Johnson (1997) conducted simulations to show that this approach can correctly classify speech input according to linguistic categories (such as vowels) as well as social categories (such as talker gender). The clustering mechanism thus predicts that listeners can successfully recognize the input [kaet] as the word cat, and that they will do so even better when it is produced by a familiar voice.
The clustering mechanisms of exemplar theory do not merely mimic the capacity for generalization that is inherent to abstractionist theories, but make their own unique predictions. As spelled out in the work of Goldinger (1998;see also Hintzman 1986) the first prediction is that the activation level for a heard word correlates with the number of matching exemplars that are stored in memory. Thus, the activation level (or "echo intensity") for a frequent word like cat should be high because [kaet] is acoustically similar to many stored exemplars, while the activation level for an infrequent word like fob should be low because [fɑb] is similar to only a few exemplars. The second prediction is that the activation content -that is, the identity of exemplars that are activated at any given moment -changes according to whether the matching exemplars are relatively heterogeneous versus homogeneous. Thus, the activation content (or "echo content") for a frequent word like cat should be relatively generic because its exemplars are heterogeneous, having been produced in many different contexts by many different voices; when they are all simultaneously activated, the result is a blend of these contexts and voices. Meanwhile, the activation content for an infrequent word like fob should be relatively specific because its exemplars are homogeneous, having been produced in only a few contexts by a few voices; when they are activated, the result is overwhelmingly faithful to these contexts and voices. In a series of shadowing experiments, Goldinger (1998) provided support for these predictions by showing that, for high-frequency words compared to low-frequency words, reaction times were faster (due to greater activation levels), but imitation of target voices was poorer (due to more generic activation content).
In the current study, we broaden the investigation beyond voices, and test the unique predictions of exemplar theory by focusing instead on positional variation (Sumner et al. 2014). Positional variation occurs when speakers can optionally realize a segment in more than one way, but only when that segment occupies a particular position. One example comes from American English, where speakers can reduce voiceless coronal stops to glottalized realizations in word-final position, ba[t] ~ ba [ʔt̚] ~ ba [ʔ] (Deelman & Connine 2001;Sumner & Samuel 2005). Crucially, speakers do not reduce [t] in word-initial position, [t h ]ip ~ *[ʔ]ip. Another example, and the focus of the current study, comes from "lowland" varieties of Spanish, where speakers can reduce voiceless alveolar fricatives to glottal fricatives or Ø in word-final position, remo [

s] ~ remo[h] ~ remo[Ø] 'oars';
and also in word-medial position before a consonant bu [s]to ~ bu[h]to ~ bu [Ø]to 'chest' (Lipski 1984(Lipski , 1994. Crucially, in most varieties of Spanish, speakers do not reduce [s] in word-initial position, [s]opa ~ *[h]opa ~ *opa 'soap' (for exceptions, see E. L. Brown & Cacoullos 2003;E. K. Brown & Brown 2012). The key point of interest is that exemplar theories, and in particular their clustering mechanisms, predict a fundamental difference between non-varying words like sopa and varying words like busto and remos. Specifically, the exemplars of sopa are relatively homogeneous by virtue of being realized with [s] alone; therefore, they should form a single cluster. Meanwhile, the exemplars of busto and remos are relatively heterogeneous because they are sometimes realized with [s] and sometimes with [h]; they should therefore form two overlapping but separate clusters, as depicted in Figure 1. (For simplicity, we omit consideration of [Ø] variants, but our overall predictions would be similar if we had included it; for related discussion see File-Muriel & Brown 2011).
Based upon the clustering depicted in Figure 1, exemplar theory makes two specific predictions that we test in the current study. The first prediction concerns situations in which speech input is restricted to [s] variants alone (here, this will be a laboratory situation). Given that the activation level of a word increases directly with the number of previouslystored exemplars (Goldinger 1998), the theory predicts that speech input with [s] should activate a non-varying word like sopa more strongly than a varying word like busto or remos. This is because the input [s]opa activates roughly all of the exemplars for sopa, since all previously-heard tokens of this word were realized with [s] and, therefore, all exemplars are acoustically similar to [sopa]. On the other hand, the input bu[s]to activates only a fraction of the exemplars for busto, since only a fraction of the previously-heard tokens of the word were realized with [s] and, therefore, only a fraction of exemplars are acoustically similar to bu[s]to. The same argument holds for the input remo[s] and the word remos. This can be visualized in Figure 1, where [s]opa has many exemplars in its cluster, but bu[s]to and remo [s] have relatively few exemplars in their respective clusters.
The second prediction concerns situations in which speech input is unrestricted and contains both [s] and [h] variants (here, this will be a laboratory situation). Note that in such a situation, we no longer expect an advantage for sopa, because the number of activated exemplars becomes equivalent across the word types. That is, speech input consisting of both bu [s] to and bu[h]to would activate roughly all of the exemplars for busto (since these are the two principal variants of the word), just as input consisting of [s]opa activates roughly all of the exemplars for sopa (since this is the sole principal variant of the word). This can be visualized in Figure 1, where the [s]opa cluster contains as many exemplars as the bu[s]to plus bu[h]to clusters combined. The same logic applies to remo[s] plus remo [h]. Nevertheless, we still expect sopa to behave differently. Given that the activation content of a word changes according to the relative heterogeneity of its exemplars (Goldinger 1998), exemplar theory predicts that a non-varying word like sopa should produce a relatively specific activation pattern because its exemplars are homogenous, while varying words like busto and remos should produce a relatively generic activation pattern because its exemplars are heterogeneous. Again, this can be visualized in Figure 1, where the single circle representing the [s]opa cluster takes up a smaller amount of acoustic space and represents a more specific pattern, compared to the two circles representing the bu[s]to plus bu[h]to clusters, which take up a larger amount of acoustic space and collectively represent a blended pattern. The same logic applies to the remo[s] plus remo [h] clusters. Previous research allows to refine this prediction even further. In an experiment focusing  on voices, input from multiple talkers (generic) produced stronger lexical activation than input from a single talker (specific) (Roediger et al. 2004). This suggests that, when word frequency is held constant, generic activation content contributes to overall higher activation levels. Adapting this finding to phonological variation, we predict that input from multiple variants, as in bu[s]to plus bu [h]to, should produce stronger lexical activation than input from a single variant, as in [s]opa. The exemplar diagrams in Figure 1 suggest that [h] realization occurs roughly 40% of the time. This is, of course, an over-simplification, because a large number of different factors have been shown to influence speakers' productions of [s] versus [h], including the variety of Spanish in question (Lipski 1984(Lipski , 1994Minnette Fox 2006) as well as the segment's position within the word (Terrell 1979;Lafford 1989;E. L. Brown & Cacoullos 2003;E. K. Brown & Brown 2012). For example, rates of [h] realization can be as high as 92% for speakers from Puerto Rico, and as low as 36% for speakers from speakers from Colombia; furthermore, for these and other documented varieties, rates differ for segments in medial compared to final position (E. K. Brown 2009). (Presumably, rates also differ from one word to the next but, to our knowledge, no study of Spanish has published the relevant statistics for individual lexical items; for statistics on variation in English, see Patterson & Connine 2001;Patterson, LoCasto & Connine 2003). Because these factors affect the relative proportion of [s] and [h] exemplars, we would also expect them to modulate the degree of activation difference between sopa words, versus busto and remos words. Crucially, however, we do not expect them to alter our two basic predictions. For any listener of any Spanish variety who has heard almost exclusively [s] in word-initial position, but some combination of [s] and [h] in medial position as well as final position, we predict stronger activation of sopa words in [s]-restricted situations, but stronger activation of busto and remos words in unrestricted situations.
The two predictions that we have laid out are important because they are unique to exemplar theory (note that ideas related to "echo intensity" are developed in the work of Connine and colleagues, but their framework does not incorporate any notion of "echo content"; Ranbom & Connine 2007;Connine, Ranbom & Patterson 2008;Pinnow & Connine 2014), and distinguish it from other theories of variant recognition that retain the notion of a single, abstract representation for each word, such as inference models (Gaskell & Marslen-Wilson 1996, 1998Pitt 2009; a related idea is investigated in Gow 2001) and underspecification models (Lahiri & Marslen-Wilson 1991;Lahiri & Reetz 2010). Testing these predictions, however, presents a methodological problem. While words like sopa, busto, and remos differ along multiple dimensions (which we discuss more fully in Section 2), the crucial difference for our purposes lies in the position of /s/. Unfortunately, segment position is known to exert a large influence in most word recognition tasks, for reasons unrelated to positional variability. Initial segments in particular play a disproportionately important role in activating representations, compared to medial or final segments (e.g., Marslen-Wilson & Welsh 1978;Nooteboom 1981;Gaskell & Marslen-Wilson 1997;and many others). And, word-initial privilege makes one of the same predictions that exemplar theories make, namely that an input containing [s] should activate words like sopa, where [s] is initial, more strongly than words like busto or remos, where [s] is medial or final. Thus, positional effects obtained in most tasks would confound any claims about differences between varying versus non-varying words.
To address this problem, the current study uses a Deese-Roediger-McDermott (DRM) false memory paradigm (seminal papers include Deese 1959;Roediger & McDermott 1995; for reviews, see Gallo 2006;2010). In this paradigm, participants see or hear lists of words that are associates or neighbors of a critical item, and subsequently try to remember those words. The key result is that participants often (falsely) remember the critical item, even though they never heard it. For example, the words rack, pack, bake, book, bag, bat, etc., are all phonological neighbors of the critical item back, differing from it by the substitution of one phoneme. After hearing such a list, listeners falsely remembered the unheard word back on average 65 to 70% of the time (Sommers & Lewis 1999). Several studies have reported similar results, for both serial recall and yes/no recognition tasks, and established the robustness of false phonological memories (Wallace, Stewart & Malone 1995;Schacter, Verfaellie & Anes 1997;Wallace et al. 1998;Wallace et al. 2001;Westbury, Buchanan & Brown 2002;Watson, Balota & Roediger 2003;Amberg, Yamashita & Wallace 2004;Garoff-Eaton, Kensinger & Schacter 2007;Ballardini, Yamashita & Wallace 2008;Ballou & Sommers 2008). In our study, we asked participants to listen to lists of neighbors such as popa, sepa, soja, etc. ('stern', 'I know (Subjunctive)', 'soy', respectively; for the critical item sopa), gusto, vasto, bulto, etc. ('taste', 'vast', 'bundle'; for the critical item busto), and demos, ramos, retos, etc. ('demonstrations', 'branches', 'challenges'; for the critical item remos). Our key question was whether rates of false recall and/or recognition differed for critical items such as sopa versus busto and remos.
Two aspects of the DRM false memory paradigm are crucial for our current goal: first, we can interpret false memory rates as reflecting lexical activation of the critical item, and second, both initial and non-initial segments contribute to this activation in an equivalent manner. Both aspects are explained by the concept of converging neighborhood activation (for a different view of false memories, see Kroll et al. 1996;Reinitz 2001). Previous work has demonstrated that when a listener hears speech input such as [popa] 'stern', she activates the representation for the target popa, but also the representations for phonological neighbors that differ in the substitution of one phoneme, such as those for sopa, papa, poca ('soap', 'potato', 'little (Fem)') and so on (Goldinger, Luce & Pisoni 1989;Goldinger et al. 1989;Luce & Pisoni 1998;Luce et al. 2000;Stockall, Stringfellow & Marantz 2004;Magnuson et al. 2007). In the DRM false memory paradigm, such activation occurs repeatedly, and eventually converges on the single word which is a neighbor to all of the items on a given list, namely the critical item (sopa). It is this converging activation which creates the experience of having heard a word (Collins & Loftus 1975;Sommers & Lewis 1999;Roediger, Balota & Watson 2001).
Additional support for this notion comes from studies demonstrating that reduced amounts of converging activation result in lower rates of false memories. Sommers and Lewis (1999), for example, manipulated the similarity of list words to the critical item. Neighbors like rack, pack, bake, book, bag, bat, etc., are strongly associated with the critical item back (because their consonants and vowels are highly confusable with [b], [ae], [k]) and, as noted above, produced high rates of false remembering. By contrast, neighbors like shack, yak, ban, batch, beak, bike, etc. are only weakly associated to back (because their consonants and vowels are less confusable with [b], [ae], [k]), and produced significantly lower rates of false remembering. Lowering the amount of converging activation by other methods, e.g., by decreasing the number of heard associates on a list, also results in lower rates of false memories Robinson & Roediger 1997; see also , showing that we can interpret rates of false recall and/or recognition as reflecting levels of spreading activation for the representation of the critical item.
Furthermore, the critical item in the DRM false memory paradigm receives equivalent activation from initial and non-initial segments. This is because the list of heard words includes neighbors that differ from the critical item by a single segment, and the position of this segment varies. For example, the critical item sopa [sopa] receives activation from spoken popa [popa] (initial C substitution), sepa [sepa] (medial V substitution), and soja [soxa] (medial C substitution). Analogously, the critical item busto [busto] receives activation from spoken gusto [gusto] (initial C substitution), vasto [basto] (medial V substitution), and bulto [bulto] (medial C substitution). Because activation occurs indirectly from diverse neighbors, rather than directly from the speech signal, initial segments do not disproportionately affect activation compared to non-initial segments. Indeed, Westbury, Buchanan and Brown (2002) showed that lists of English phonological neighbors with initial CV overlap (bade, bane, beige, etc.) and lists with final VC overlap (make, wake, sake, etc.) produced equivalent false memory rates for the critical item (bake). Thus, a key advantage of the false memory approach is that we can attribute any differences between false memories for sopa versus busto and remos to their non-varying versus varying status, rather than to task-dependent asymmetries in the role played by [s] in initial position.
Although the basic predictions outlined above could be investigated in any number of languages with positional variation, Spanish offers a particularly interesting test case for exemplar theory because of the diversity of factors that have been shown to influence speakers' productions of [s] versus [h]. In addition to those factors we have already mentioned, gender plays a role (Ma & Herasimchuk 1971;Mack 2011), as does the contextual environment for disambiguation (Poplack 1980;Hundley 1987). Since theories with abstract representations, such as inference and underspecification models, do not have the mechanisms in place to incorporate these non-linguistic factors, some authors have already argued for an exemplar-based theory of the lexicon specifically by invoking Spanish production data (Bybee 2000;E. K. Brown 2009 (Hammond 1978;Widdison 1995;Figueroa 2000;Boomershine 2006;Mack 2011;Carlson 2012;Schmidt 2013Schmidt , 2015. One of these studies employs an exemplar framework: Boomershine (2006)

Experiment 1: Restricted speech input with [s] variants alone
Experiment 1 compared false recall and recognition rates for critical items with initial, medial, or final /s/. The neighbors for these critical items were all pronounced with [s]. The prediction is that false recall and recognition rates should be higher for critical items with /s/ in initial position, compared to other positions.

Method
As critical items, we selected eighteen Spanish words. As shown in Table 1, six of these words contained /s/ in initial position, six contained /s/ in medial position before a consonant, and six contained /s/ in final position. All critical items were bisyllabic, with at least nine phonological neighbors, and all were either nouns or adjectives.
As will be apparent, while the critical items in each condition crucially differed in the position of /s/, they also differed in syllable structure as well as morphological composition. For syllable structure, the critical items with initial /s/ use CV.CV, those with medial /s/ use (C)VC.CV, and those with final /s/ use (C)V.(C)VC. To a large extent, these differences arise directly from the phonological environments that prohibit variation between [s] and [h] (syllable-initial) versus those that license it (syllable-final). The solution to this problem would have been to impose a maximal CVC.(C)VC structure on all critical items, but this was not feasible. Spanish has a highly restricted inventory of consonants in the syllable coda, and only /s, l, ɾ, d/ and nasals are commonly attested in this position (Hualde 2005: 76). Thus, the number of Spanish words with CVC.(C)VC structure is relatively small to begin with; of those that do exist, many are formed with the addition of plural -s (e.g., sopas 'soups', bustos 'chests'), an unsuitable addition to critical items that already contained /s/ in a different position. In addition, every critical item required a minimum of nine phonological neighbors, and only a handful of CVC.(C)VC words in Spanish meet this criteria. Thus, constraints on the Spanish lexicon precluded the imposition of a uniform syllable structure. For morphological composition, the critical items in the initial and medial /s/ conditions are singular forms, while those in the final /s/ condition are plural forms. Again, this is due to constraints on the lexicon: very few Spanish words end with a non-plural /s/.
Previous work in the DRM false memory paradigm suggests that differences in syllable and morphological structure should not affect our results. Two studies have used critical items with a variety of syllable structures, such as CVC (dog, face, fat, etc.), CCVC (flag, glass, sleep, etc.), and CVCC (hand, cold, hard, etc.) Watson et al. 2003), and reported false recognition results comparable to those for studies in which the critical items were exclusively CVC, such as Sommers and Lewis (1999). A third study used critical items containing both simple onsets as well as onset clusters, and also reported comparable results (Schacter et al. 1997: 337). In the morphological domain, Pycha (2017) compared critical items that were either morphologically simple (rise, fade, etc.) versus morphologically complex (lies, paid, etc.) and reported no differences between these two conditions. While more research is needed to investigate these issues further, it would appear that neither syllable structure nor morphological structure exert demonstrable effects on false memory rates. The lexical statistics for the eighteen critical items are displayed in Table 2. Although our intent was to balance frequency, number of neighbors, and neighborhood frequency across the three conditions, constraints on the Spanish lexicon prevented us from doing so. The language simply does not contain a large number of words that meet the basic criteria for being a critical item (contains /s/, bisyllabic, noun or adjective, at least nine neighbors), and thus did not provide the option of selecting words as a function of their lexical characteristics. In order to control for their influence, we used a composite -the frequency-weighted neighborhood probability rule, or FW-NPR -that reflects all three of these characteristics and included it as a factor in our statistical analysis. Adapted from Luce and Pisoni (1998: 12), we calculated the FW-NPR as (Frequency of target) divided by (Frequency of target) + (Summed frequencies of neighbors). For example, sopa has a raw frequency of 25.82 per million words, and therefore a log frequency of 1.43. And, sopa has eleven neighbors, whose log frequencies sum to 18.67. Therefore, the FW-NPR for sopa is (1.43)/(1.43 + 18.67) = 0.07. Luce and Pisoni (1998) show that FW-NPR values for American English words correlate closely with the outcomes of word recognition studies, suggesting that this metric serves a valid predictor of the effects of lexical characteristics on activation of word representations.
For each critical item, we constructed a list of nine phonological neighbors that differed by the addition, deletion, or substitution of a single phoneme. Sample lists are displayed in Table 3, and the complete list of stimuli is in the Appendix.
Note that in Spanish orthography, the letters b and v at the beginning of a word indicate the same sound, namely [b]. Thus, busto, vasto, and visto all begin with [b]. The letter h is silent, as in hemos [ˈe.mos] (Hualde 2005: 7). Accent marks indicate stress, e.g., gustó [gus.ˈto] has stress on the final syllable. In words with no accent mark and an open final syllable, stress generally falls on the penultimate syllable, e.g., gusto [ˈgus.to] 'he/she/you liked'. In words with no accent mark and a closed final syllable, stress generally falls on the final syllable except in plural words, where stress falls on the same syllable as in the singular form, e.g., demo [ˈde.mo], demos [ˈde.mos] (Hualde 2005: 222-223).
As is apparent in Table 3, all neighbors occurred as isolated words. This allowed us to adhere to the definition of "neighbor" as a word that differs by a single phoneme, which is a crucial feature of the DRM false memory tasks. However, this strategy also presented some Table 2: Lexical statistics (means and standard deviations) for critical items used in the design of Experiments 1 and 2, from the Clearpond database (Marian et al. 2012). Frequencies are reported as log 10 (frequency+1). Neighbors is a count of words that differ from the critical item by the addition, deletion, or substitution of one phoneme. Neighbor frequency is the mean frequency of all neighbors of a critical item. FW-NPR is the frequency-weighted neighborhood probability rule, adapted from Luce and Pisoni (Luce & Pisoni 1998: 12); see text.  (Hualde 2005: 87, 97-98). Although researchers have offered evidence that an individual exemplar may indeed contain more than one word (Bybee 2001(Bybee , 2002, we are not aware of any study that has examined the perceptual consequences for the individual words involved. Thus, while it is possible that a phrase such as pocos amigos could form a single exemplar, particularly if it occurs frequently, we simply do not yet know how this impacts the storage of the individual words pocos and amigos (similarly for phrases such as los sacos). Future research may shed light on this issue. In the meantime, we note that our use of isolated speech did help to avoid potential ambiguities in perception. As Hualde (2005: 87) points out, resyllabification of /s/ creates homophony between phrases such as las alas [la.sa.las] 'wings' and la salas [la.sa.las] 'you salt it', such that the listener may not always know which form the speaker intended. In the context of the current study, no such ambiguity was posssible. The lexical characteristics for the 162 neighbors are displayed in Table 4. As with the critical items, our intent was to balance these characteristics across the three conditions, but constraints on the Spanish lexicon prevented us from doing so. Following the reasoning described above, then, we also included the mean FW-NPR for neighbors in the statistical analysis of our results.

Frequency Neighbors Neighbor frequency FW-NPR
To create the fillers, we selected twelve critical items that did not contain /s/, and created a list of nine phonological neighbors for each, yielding a total of 108 words.

Recording
The speaker for the recording was an educated, middle-aged, female native speaker of the Puerto Rican variety of Spanish, living and working in the United States, and unaware of the purpose of the experiment. She recorded the list words as well as the critical items for

Procedure
The thirty lists of phonological neighbors (eighteen target lists plus twelve filler lists) were divided into two sets of fifteen (A and B), each containing three lists from the initial /s/ condition, three from the medial /s/ condition, three from the final /s/ condition, and six from the fillers. Each participant was randomly assigned to set A or set B. During the experiment, participants were seated in a quiet setting in front of a computer equipped with a mouse, keyboard, and high-quality headphones. Spanish-speaking research assistants (one who spoke a Mexican variety without [s] ~ [h] variation, and another who spoke a Colombian variety without [s] ~ [h] variation) greeted the participants and explained the procedure in Spanish. Printed instructions on the computer screen, also in Spanish, guided participants through each step.
In the study phase, participants listened to fifteen lists of nine spoken words. Each word on a list was played individually, followed by 1 second of silence before the onset of the next word. After each list, participants did a free recall task, in which they were given 45 seconds to type as many words as they could remember from the list, in any order. After 45 seconds, they proceeded to the next list. The overall order of the fifteen lists, as well as the order of the nine words within each list, was randomized for each participant.
In the test phase, after listening to all fifteen lists, participants did a recognition task in which they listened to an individual spoken word, and made a yes/no judgment as to whether they had heard the word previously in the experiment. There were 92 items in the recognition task, which included 45 "old" words that the participant had actually heard (three from each of the fifteen studied lists), plus 47 "new" words that the participant had not heard. The unheard words included the fifteen critical items from the participant's own set. In addition, the unheard words included 32 foils, which were created by using words from the opposite set (that is, for participants assigned to set A, the foils came from set B, and vice-versa). The foils consisted of eight critical items from the other set (one from each of eight unheard lists, which included two list from initial /s/ condition, two from medial /s/ condition, two from final /s/ condition, and two fillers), and twenty-four neighbor words from other sets (three from each of eight unheard lists, again with equal numbers from each condition). As with the heard lists, critical items in the recognition task were pronounced with [s], e.g., [s]opa, bu[s]to, remo [s]. The order of items in the recognition task was randomized for each participant.

Setting: Multiple varieties of Spanish in contact
Experiments 1 and 2 both took place on a university campus in a small city in the midwestern region of the United States. This setting placed constraints on our participant pool, because only a limited number of Spanish speakers were available to complete the experimental tasks in our on-campus laboratory. As such, it was not feasible to limit recruitment to speakers from a single region of the world. On the other hand, this setting also offered an advantage, because the Spanish-speaking population in the U.S. is notably diverse, representing over twenty different countries of origin. While the majority Of course, Spanish speakers living in the U.S. also have exposure to American English. It seems unlikely, however, that this fact affects our basic predictions, which are based upon the premise that similar-sounding exemplars cluster together. Spanish and English have different lexicons, so exemplars from these two languages are not typically similarsounding (e.g., Spanish [remos] sounds totally different from English [o w ɹz]). A possible exception concerns cognate words like Spanish sopa [sopa] 'soup' and English soup [sup] but even in these cases, segmental differences are present, and these would seem to mitigate against storing such exemplars in the same cluster.
To address the diversity of Spanish varieties within our participant pool, we formulated a questionnaire that asked each participant to indicate their age, gender, country of origin (or if they were born in the U.S., the country of their parents' origin), and number of years in the U.S. The questionnaire also provided simple examples of [s] ~ [h] variation, and asked participants to answer a multiple-choice questions indicating whether they a) spoke a variety of Spanish with that type of variation, b) could easily understand such a variety, even if they did not speak it themselves, or c) had difficulty understanding such varieties. The questionnaire was written in Spanish, and participants completed it after the experimental tasks. During data analysis, questionnaire data allowed us to consider the extent to which country of origin affected our results.

Participants in Experiment 1
Forty-nine native speakers of Spanish served as participants in Experiment 1. A large majority (38) had Mexican origins (sixteen of these had grown up in Mexico, and twenty-two had grown up in the United States with parents of Mexican origin). In addition, two participants were from Colombia, two were from the United States with parents of Puerto Rican origin, one was from Chile, one from Panama, and one from Spain. Four participants did not state their country of origin.
On the questionnaire, thirty-one of the participants indicated that they could easily understand varieties of Spanish with [s] ~ [h] variation. Four of them spoke such a variety themselves, while fourteen indicated they had difficulty understanding such varieties. Thirty-four participants were female, and fifteen were male. Their ages ranged from 20 to 55, with a mean of 32 (11.90). They were all living in the United States at the time of the study, and spoke English in addition to Spanish. The experiment took approximately forty-five minutes of each participant's time, and they received cash compensation.

Recall results for Experiment 1
The recall task yielded a total of 4,127 responses. Thus, on average, listeners responded with 5.61 items per list (=4,127/(15 lists per participant *49 participants)). Once the fillers were removed, there were 2,427 target responses. Participants sometimes typed the same word twice for one list, resulting in 206 duplicates (68 in the initial /s/ condition, 72 in the medial /s/ condition, and 66 in the final /s/ condition), which we removed.
During data analysis, we attended to the fact that Spanish orthography has several instances in which different letter or letter combinations map to the same spoken sound. Both b and v indicate [b] (when initial) or [β] (when medial), so we treated these interchangeably and accepted e.g., cavo for intended cabo 'cape', as well as labo for intended lavo 'I wash'. Both ll and y indicate [ʎ] (or alternatively [ʝ], depending upon dialect), so we treated these interchangeably and accepted e.g., caya for intended calla 'he/she/you shuts up', as well as cullo for intended cuyo 'whose'. Both ce and s indicate [s], so we treated these interchangeably and accepted e.g., caceta for intended caseta 'stand', as well as sesto for intended cesto 'basket'. Finally, orthographic h has no corresponding phonetic realization, so we accepted e.g., arta for intended harta 'full'.
Apart from these alternative spellings, there were 12 typos that did not produce a Spanish word (2 in initial /s/, 6 in medial /s/, and 4 in final /s/), which we removed. In addition, there were four responses that did not contain appropriate coding, due to experimenter error. After removal of fillers, duplicates, typos, and errors, 2,205 responses were included in the final analysis.
Following the methodology used in previous studies on false recall (Roediger & McDermott 1995;Sommers & Lewis 1999) we classified a response as "veridical" if it corresponded to a word that actually occurred on the list (e.g., sopa, papa, poca, etc.), "intrusion" if it did not occur on the list (e.g., random intrusions such as buena 'good (Fem)' cosa 'thing', etc.), and "critical item" if it corresponded to the critical item (e.g., sopa). To calculate veridical proportions, again following previously established methodology, we divided the number of veridical responses per list by nine, which was the number of words that actually occurred on each list. For example, if a participant provided four veridical responses for a list, the proportion of veridical responses would be 0.44 = 4/9. For intrusion proportions, we divided the number of intrusion responses by the total number of responses per list. For example, if a participant provided one intrusion response for a list, plus four veridical responses and one critical item, the proportion of intrusion responses would be 0.17 = 1/(1 intrusion + 4 veridical + 1 critical item). Finally, the critical item proportion was calculated as 0 if the participant did not respond with the critical item, and 1 if they did. The mean proportion of veridical, intrusion, and critical item responses given across the three conditions is shown in Table 5.
To analyze the results, we used a mixed logit model in which we counted each response as a "success" and each possible lack of response as a "failure", following Jaeger (2008). For example, if a participant provided five veridical responses for a list, we counted five successes plus four failures, where failures indicate the four words that the participant heard but did not recall. If a participant provided two intrusion responses, we counted two successes, and the remaining responses (veridical plus critical item) as failures. If a participant provided the critical item as a response, we counted one success, and zero failures. Although many previous studies on the DRM false memory paradigm analyze results with ANOVA, these traditional tests pose documented problems for categorical outcome variables (Jaeger 2008), and also do not allow the inclusion of random effects.
We used the glmer() function from the lme4 package in R, with predictor variables of response type (Veridical vs. Intrusion vs. Critical item) and condition (Initial /s/ vs. Medial /s/ vs. Final /s/), and with two random intercepts: one for the critical item's FW-NPR and one for the mean FW-NPR of corresponding neighbors on the list. For example, following the calculations described in Section 2.1, the FW-NPR of the critical item busto is 0.03, and this was used as one random intercept. The nine heard neighbors on the busto list (gusto, vasto, bulto, etc.) have a mean FW-NPR of 0.11, and this was used as another random intercept. Models that also included random intercepts for participant failed to converge. We used treatment coding. "Initial /s/" served as the baseline for position, because previous studies on phonological false memories have focused almost exclusively on critical items of this type (i.e., non-varying words), and our key motivation was to test how "Medial /s/" and "Final /s/" would deviate from this baseline. "Critical item" served as the baseline for response type, because this allowed us to test for a basic false memory effect, in which we expect a higher rate of responses to critical items compared to random intrusions. This baseline also allowed us to make our crucial comparison between Initial /s/ versus Medial and Final /s/ specifically for recall of critical items (rather than for recall of veridical items or intrusions). Results of the model are displayed in Table 6.
The model shows that participants exhibited a significant false memory effect, but that the predicted differences between words with initial /s/ versus those with medial or final /s/, while trending in the expected direction, did not reach significance. The false memory effect is apparent because the odds of a response decreased in the Intrusion condition compared to the Critical Item baseline, by a factor of approximately 0.58 (=e -0.55 ). In other words, participants were much less likely to recall a random intrusion such as buena 'good (Fem)' compared to a critical item such as sopa. Also, as expected, the odds of a response increased in the Veridical condition compared to the Critical Item baseline, by a factor of approximately 1.86 (=e 0.62 ). That is, participants were more likely to recall a word they had actually heard, such as popa, sepa, or soja, than a critical item they had not heard. Rates of recall for medial /s/ and final /s/ did not differ significantly from the baseline, which indicates that participants were equally likely to falsely recall critical items such as sopa, busto, and remos. The model also indicated one interaction. Compared to the baseline, the odds of a response increased in the Final /s/ Intrusion condition by a factor of approximately 2.08 (=e 0.73 ).
A separate analysis that included only the thirty-eight participants of Mexican origin yielded an identical pattern of results. And, a separate analysis that included only participants who could speak or easily understand [s] ~ [h] varieties also yielded an identical pattern of results. As in the main model, both of these separate analyses revealed decreased odds in the Intrusion condition, increased odds in the Veridical condition, and an interaction in the Final /s/ Intrusion condition. Both analyses showed a trend toward greater false recall in the Initial /s/ condition compared to Medial and Final /s/ conditions, but the trend did not reach significance in either analysis. The recognition task yielded 4,805 responses for analysis (=92 items * 49 participants). As with the recall task, "veridical" items actually occurred on a list while "critical items" were the unheard words used to construct the lists. "Intrusion" items did not occur on any list that the participant heard; differently from the recall task, these occurred as foils drawn from the experimental set that the participant was not assigned to, as described in Section 2.3. To calculate veridical proportions, we divided the number of "yes" responses by the total number of items of this type. For example, participants gave judgments to three words from each of the fifteen lists that they heard previously at study; if they responded "yes" to two of these, their veridical proportion for this list was 0.67 (=2/3). Participants also gave judgments to four intrusions from each of eight lists that they did not hear; if they responded "yes" to one of these, their intrusion proportion was 0.25 (=1/4). Finally, participants gave judgments to one critical item from each of fifteen lists that they heard previously at study; if they responded "yes" to it, their critical item proportion was 1 (=1/1). The mean proportion of "yes" responses given to veridical words, intrusions, and critical items is shown in Table 7. We analyzed the results using a mixed logit model similar to that used for the recall task. We counted each "yes" response as a success and each "no" response as a failure. Unfortunately, models with a full interaction between item type (Veridical vs. Intrusion vs. Critical item) and condition (Initial /s/ vs. Medial /s/ vs. Final /s/) failed to converge. Therefore, we ran an analysis on a subset of the data that included critical item responses only (excluding veridical responses and intrusions), with a random intercept for critical item FW-NPR. (The list FW-NPR, which calculates an aggregate reflecting a list of words as a whole, is not relevant for the recognition task, in which participants are cued with individual spoken words rather than being asked to perform free recall from an entire list). We used treatment coding such that "Initial /s/" served as the baseline for position. Table 8 displays the results of the model.
Results show that rates of recognition for medial /s/ and final /s/ did not differ significantly from the baseline, which indicates that participants were equally likely to falsely recall critical items such as sopa, busto, and remos.
A separate analysis that included only the thirty-eight participants of Mexican origin yielded an identical pattern of results. And, a separate analysis that included only participants who could speak or easily understand [s] ~ [h] varieties also yielded an identical pattern of results. As in the main model, rates for recognition for medial /s/ and final /s/ did not differ significantly from the baseline.  Experiment 1 tested the exemplar-based hypothesis that spoken words restricted to [s] variants should activate representations for words like sopa more strongly than representations for words like busto or remos. Results support the hypothesis only insofar as the data exhibits a trend in the appropriate direction for the recall task. These data show that participants falsely recalled words with initial /s/, such as sopa, approximately 30% of the time. But they falsely recalled words with medial and final /s/, such as busto and remos, approximately 23% or 24% of the time. These differences are in the predicted direction, but our analysis indicated they were not significant. Also, the data exhibited no trends in the appropriate direction, significant or otherwise, for the recognition task. While the lack of significance could mean that exemplar theory makes a fundamentally incorrect prediction, it could also be due to limitations of the current study. In particular, while the majority of participants (31 out of 49) indicated that they could understand varieties of Spanish in which non-initial [s] varies with [h], they did not speak such a variety themselves. Previous research has shown that participants can rapidly adapt to new dialects during listening tasks (e.g., Dahan, Drucker & Scarborough 2008;Sumner & Samuel 2009), but the premise of the current study was different in the sense that we assumed a set of stored exemplars affected by a lifetime of listening experience. It is possible that, in their lifetimes, our participants simply had not heard enough exemplars containing [h] to produce significant representational differences between words like sopa versus busto and remos. Schmidt (2015)

Experiment 2: Speech input with [s] and [h] variants
Like Experiment 1, Experiment 2 compared false recall and recognition rates for critical items with initial, medial, or final /s/. Unlike Experiment 2, neighbors for the critical items with medial or final /s/ were pronounced sometimes with [s], and sometimes with [h]. The prediction is that speech input with [s] and [h] variants should activate varying words (those with medial or final /s/, such as busto and remos) to a greater degree than speech input with the [s] variant alone can activate non-varying words (those with initial /s/, such as sopa).

Method
The stimuli and procedure were identical to Experiment 1, as was the speaker who produced the recording. Note that it is possible that our recording procedure created a somewhat artificial listening context for the participants of Experiment 2, who heard the same speaker alternate between words with [s] and words with [h] in a balanced fashion, a situation that is unlikely to occur in everyday circumstances. While using multiple speakers (e.g., one who pronounced [s] variants only, and another who pronounced [h] variants only) might have created a more natural-seeming context, it would have confounded the interpretation of the results, because participants would conceivably respond in different ways to the different voices -indeed, the research on voice information cited in Section 1 strongly suggests that such a scenario would occur. Because we were interested in isolating the effects of variation on word activation, it was important not to confound these with the effects of voices, and so we used a single speaker to present the stimuli.

Participants
Experiment 2 took place at the same university campus as Experiment 1. Thus, while it was subject to the same constraints on participant recruitment described in Section 2.4.1, we continued to operate under the assumption that the United States represents a contact situation in which many, and possibly most, Spanish speakers have substantial exposure to [s] ~ [h] variation.
Forty-seven native speakers of Spanish, none of whom participated in Experiment 1, participated in the study. They came from diverse backgrounds, although as we discuss in Section 3.4, their results patterned uniformly. Eleven participants were from Spain, nine from Colombia, and eight from Mexico (seven who grew up there, and one who grew up in the United States with parents of Mexican origin). In addition, five participants were from Puerto Rico (four who grew up there, and one who grew up in the United States with parents of Puerto Rican origin), three from Dominican Republic, three from Venezuela, two from Guatemala, one from Peru, and one from Honduras. Four participants did not state their country of origin.
The participants completed the same questionnaire used in Experiment 1. On it, twentynine of the participants indicated that they could easily understand varieties of Spanish with [s] ~ [h] variation. Eleven of them spoke such a variety themselves, while seven indicated they had difficulty understanding such varieties. Twenty-nine participants were female, and eighteen were male. Their ages ranged from 18 to 63, with a mean of 28 (8.93). They were all living in the United States at the time of the study, and spoke English in addition to Spanish. The experiment took approximately forty-five minutes of each participant's time, and they received cash compensation.

Recall results for Experiment 2
The recall task yielded a total of 4,349 responses. Thus, on average, listeners responded with 6.17 items per list (=4,349/(15 lists per participant * 47 participants)). Once the fillers were removed, there were 2,506 target responses. Duplicate responses were removed (127 in the initial /s/ condition, 129 in the medial /s/ condition, and 101 in the final /s/ condition), and so were typos that did not produce a Spanish word (27 in initial /s/ condition, 26 in medial /s/ condition, 15 in final /s/ condition), leaving a total of 2,081 responses for the final analysis. We coded responses and calculated proportions in the same manner as described for Experiment 1. The mean proportion of veridical, intrusion, and critical item responses given across the three conditions is shown in Table 9.
To analyze the results, we followed the same procedure described for the recall task in Experiment 1; namely, a mixed logit model with predictor variables of response type (Veridical vs. Intrusion vs. Critical item) and condition (Initial /s/ vs. Medial /s/ vs. Final /s/), and random intercepts for the critical item's FW-NPR and for the mean FW-NPR of corresponding neighbors on the list. Models that also included random intercepts for participant failed to converge. As before, we used treatment coding with "Initial /s/" and "Critical Item" as baselines. The results of this model are in Table 10.
The model shows that participants exhibited a significant false memory effect, because the odds of a response decreased in the Intrusion condition compared to the Critical Item baseline, by a factor of approximately 0.56 (=e -0.58 ). Also, as expected, the odds of a response increased in the Veridical condition compared to the Critical Item baseline, by a factor of approximately 2.51 (=e 0.92 ). The model indicates no significant difference in false memory rates for critical items with medial /s/ compared to the baseline, which suggests that participants were equally likely to falsely recall critical items such as sopa and busto. The model does indicate a significant difference in false memory rates for critical items with final /s/ compared to the baseline, but in a direction contrary to our predictions. The odds of a response decreased in the Final /s/ condition compared to the Initial /s/ baseline, by a factor of approximately 0.54 (=e -0.62 ), which shows that participants were less likely to falsely recall a critical item such as remos compared to sopa. The model also indicated one interaction. Compared to the baseline, the odds of a response increased in the Final /s/ Intrusion condition by a factor of approximately 7.61 (=e 2.03 ).

Recognition results for Experiment 2
The recognition task yielded 4,324 responses for analysis (=92 items * 47 participants). The mean proportion of "yes" responses given to veridical words, intrusions, and critical items is shown in Table 11.
As in Experiment 1, we analyzed the results using a mixed logit model with predictor variables of item type (Veridical vs. Intrusion vs. Critical item) and condition (Initial /s/   vs. Medial /s/ vs. Final /s/), with a random intercept critical item FW-NPR. Models with participant as a random factor failed to converge. We used treatment coding with "Initial /s/" and "Critical Item" as baselines. The results of this model are in Table 12. The model shows that participants exhibited a significant false memory effect, because the odds of a response decreased in the Intrusion condition compared to the Critical Item baseline, by a factor of approximately 0.33 (=e -1.12 ). Also, as expected, the odds of a response increased in the Veridical condition compared to the Critical Item baseline, by a factor of approximately 2.89 (=e 1.06 ).
Consistent with our predictions, the model indicates a significant increase in false memory rates for critical items with medial /s/ compared to the baseline. The odds of a response increased in the Medial /s/ condition compared to the baseline, by a factor of approximately 2.32 (=e 0.84 ), which shows that participants were more likely to falsely recall critical items such as busto compared to sopa. The model does not, however, reveal such a finding for critical items with final /s/. The model also indicates one interaction. Compared to the baseline, the odds of a response decreased in the Medial /s/ Veridical condition by a factor of approximately 0.36 (=e -1.03 ).

Summary and discussion for Experiment 2
Experiment 2 tested the hypothesis that speech input containing [s] and [h] will activate representations for words like busto and remos more than input containing only [s] will activate representations for words like sopa. Results support the hypothesis, but only for words with medial /s/ in the recognition task. Thus, our key finding is that rates of false recognition were significantly greater for critical items like busto compared to the baseline sopa.
Although a potential caveat comes from the diversity of our participant group, which we noted in Section 3.2, our result was consistent across different sub-groups of participants. This is shown in Table 13, which displays rates of false recognition for critical items (veridical words and intrusions are omitted), broken down by participants' level of  Table 13: Proportion of "yes" responses to critical items in the recognition task for Experiment 2. "Speak" refers to participants who speak a variety of Spanish with variation between [s] and [h], "Understand" refers to those who don't speak such a variety but understand it with ease, and "Difficulty" refers to those who don't speak such a variety and have difficulty understanding it. Additional support comes from Table 14, which displays rates of false recognition broken down by participants' countries of origin. With just a couple of exceptions, regardless of where participants originated, the pattern is the same: again, higher rates of false recognition for critical items with medial /s/, such as busto, compared to those with initial /s/, such as sopa.

Speak Understand Difficulty Mean
The upshot is that a diverse participant group yielded a relatively consistent pattern of results, strengthening our conclusion that speech input containing [s] and [h] will activate representations for words like busto more than input containing [s] will activate representations for words like sopa.
This conclusion raises two further questions. First, why do we see the effect only in the medial /s/ condition, and not final /s/? Second, why do we see the effect only in recognition, not recall? Both questions have potentially straightforward answers, which we turn to in the following sections.

Medial versus final conditions
The different patterns for the medial versus final /s/ conditions probably have their origins in perceptual factors; specifically, in the inaccurate perception of [h] variants. We can see this by examining the recall data in Table 9, which shows that, for the final /s/ condition, rates of veridical recall (0.30) were lower than the intrusion rate (0.46). In other words, participants were more likely to write down a random intrusion than to accurately remember a word that they had actually heard on the list. This is a highly aberrant pattern which suggests that our participants did not accurately perceive some of the list items to begin with -in particular, they may not have perceived intended [h] in final position.
To determine if this may have been the case, we counted the number of times that participants gave a recall response which omitted the expected orthographic s corresponding to spoken [h]. In the medial /s/ condition, there were 22 such instances (e.g., participants provided the response gata instead of the expected gasta 'he/she/you spends') and in the final /s/ condition, there were 212 such instances (e.g., they provided the response rezo instead of the expected rezos 'prayers').
It is unlikely that the large number of misperceptions in final position is due to excessive reduction (i.e., to Ø) in our stimuli, because our speaker consistently produced audible correlates of [h]. Figure 2 shows a typical pronunciation, where frication noise associated with [h] is clearly present.
Despite the presence of [h] in the acoustic signal, it is nevertheless not entirely surprising that our participants sometimes misperceived it. Previous experiments have examined In our experiment, if participants did not perceive list words as intended, this would have an obvious impact on false memories, which, as described in Section 1, arise from converging activation produced by phonological neighbors. That is, a false memory for remos depends crucially upon the accurate perception and activation of words like demos, ramos, retos, etc. To further pursue this idea, we performed an alternative coding of the recall data, in which the omitted-s responses such as gata and reto were counted as "Veridical" instead of "Intrusion". The resulting proportions are displayed in Table 15.
Importantly, the aberrant pattern is no longer evident: in the final condition, rates of veridical recall (0.45) are now higher than the intrusion rate (0.16), as expected. This strongly suggests that the low rate of false memories for final /s/ words -in both recall and recognition tasks for Experiment 2 -does not arise from a fundamental difference in lexical representations between words like busto and remos, but rather from difficulties in perceiving the list words. Future work could address this issue, potentially by using crosssplicing to insure robust and uniform realization of [h] in word-final positions.

Recognition vs. Recall
We turn now to the second question, namely, why do we see the predicted pattern of higher false memories for busto only in recognition, not recall? One possible answer is a test effect: participants completed the recall task first, which may have affected their subsequent performance on the recognition task. Indeed, many previous studies have reported  significantly higher rates of false recognition for critical items when a recognition task occurs immediately after recall, compared to when it occurs immediately after an unrelated task (such as math problems), (summarized in Gallo 2006: 148-151). In the current study, then, the presence of a prior recall task may have increased false recognition rates somewhat, such that differences between conditions which were not detectable in recall became observable in recognition. Yet this answer is not completely satisfactory, particularly in light of the many previous studies which report patterns of false recall and recognition that are highly similar to one another (see Gallo 2006: 23-30). Most tellingly, Roediger and colleagues ran a regression analysis on data collected for 55 DRM lists of semantic associates and concluded that " […] within the limits of this study, the factors responsible for false recognition across lists seem to be the same as those producing false recall" (Roediger et al. 2001: 392). This is clearly not the case for Experiment 2 of the current study, and so future work will need to more fully explore the differences between recall versus recognition tasks, and link these differences to varying versus non-varying lists (one possible starting point is Roediger et al. 2004). Successful recall, for example, takes place in the absence of any cue, and therefore requires a process of explicit recollection. Successful recognition, on the other hand, takes place in the presence of an explicit cue, and may therefore require only a feeling of implicit familiarity (see Yonelinas 2002). What characteristics of the current stimuli tap into this distinction, and are these characteristics unique to lists of words that contain phonological variants? Answering this question will not only reveal more about the difference in lexical representations between varying words like busto versus nonvarying words like sopa, but also help us to reach bigger-picture conclusions about when and why people form false memories.

Discussion
The current study investigated two predictions of exemplar theory, which is unique among models of variant word recognition because it proposes that listeners store multiple instances of heard words, rather than a single abstract representation. As outlined in Section 1, a great deal of previous work in exemplar theory has focused on voices. Yet the predictions that the theory makes for words with multiple variants are particularly interesting in their own right, because such words will -by definition -be stored differently in the minds of listeners than words with only a single variant. We focused specifically on [s] ~ [h] variation in Spanish, which is an important pattern because it is influenced by a variety of well-documented factors that have already justified an exemplar approach to production. One goal of the current study was to bolster this approach for perception as well.
Our results provide partial support for our predictions. In Experiment 1, where all heard words were pronounced with [s], the predicted difference between varying and non-varying words was evident only in a non-significant trend in the recall task. That is, although varying words like busto and remos showed somewhat lower rates of false recall -reflecting lower levels of lexical activation -compared to non-varying words like sopa, this different was not significant. In Experiment 2, where varying words were pronounced with either [s] or [h] but non-varying words were pronounced only with [s], the predicted difference was evident in a significant effect in the recognition task. That is, varying words like busto showed significantly higher rates of false recognition -reflecting higher levels of lexical activation -compared to non-varying words like sopa. As discussed in Section 3.3.3, the lack of effect for varying words like remos is probably due to unrelated perceptual issues.
An important issue for future research is to understand why we obtained significant results for Experiment 2, but not Experiment 1. One possibility is the participant groups,  [h]to within a single list), while Experiment 1 did not, and it is possible that this difference affected participants' responding strategies in the memory tasks. In general, the trend in the predicted direction in Experiment 1, at least for recall, suggests that in future work with increased number of participants, a significant result could emerge.

Predictions based on varieties of Spanish and word position
The predictions that we outlined in Section 1 relied upon a somewhat crude distinction between non-varying versus varying words. We worked with the assumption that, for "lowlands" varieties of Spanish, words like sopa are realized with [s] "all of the time", whereas words like busto and remos are realized with [s] "some of the time." Although this crude distinction was sufficient for testing our basic hypothesis, and also appropriate for experiments with participants living in a Spanish-variety contact situation in the United States, the literature on production of Spanish [s] ~ [h] provides sufficient data to make more refined hypotheses that could, in the future, delve more deeply into the predictions of exemplar theory for perception.
One such distinction arises from considering the varieties of Spanish spoken in different parts of the world. As we mentioned in Section 1, even among those varieties that clearly exhibit  (Lipski 1994;E. L. Brown & Cacoullos 2003).
Another distinction arises from considering the differences between words like busto, where /s/ is word-medial, versus remos, where /s/ is word-final. As we mentioned in Section 1, even within a given variety of Spanish that exhibits [s] ~ [h] variation, there are sometimes large differences in rates of [h] realization between these two types of words, as shown in Table 17.
Again, with this type of data, exemplar theory makes some clear predictions that distinguish it from abstractionist theories. For listeners from Cali, for example, the difference The possibility of such investigations highlights the richness of exemplar theory and the opportunities that it affords for making connections between production and perception patterns.

Exemplar theories versus other theories of variant word recognition
A fundamental tenet of exemplar theory is that surface phonetic variation does not reduce to a single lexical representation, but is instead stored in the lexicon as multiple episodes, which cluster together according to their acoustic similarity. As such, the theory predicts fundamental differences between words that have multiple surface variants versus those that have only one, and can explain the differences between sopa and busto that we reported in Experiment 2. It is important to see that other theories of variant recognition cannot account for such data. In inference models, listeners essentially "undo" phonological rules that produce surface forms such as [buhto], and access a representation containing an underlying form such as /busto/ (Gaskell & Marslen-Wilson 1996, 1998Pitt 2009). In other words, upon hearing [buhto], the listener undoes the rule stating that /s/ becomes [h] at the end of syllables. With speech input containing [s], then, this model makes no prediction for a difference in activation levels among words like sopa, busto, and remos, because the word is already realized with its underlying form and there is no rule to undo. With speech input containing [h], the model does make a prediction, namely that activation levels for words like busto and remos should be lower than for words like sopa, because the presence of [h] requires the undoing of a rule and therefore exacts a cost in processing. But as we have seen, the data from Experiment 2 indicate the opposite finding.
In underspecification models, listeners map speech inputs onto representations that are not fully specified for all features (Lahiri & Marslen-Wilson 1991;Lahiri & Reetz 2010). Thus, a surface input [busto] or [buhto] would activate an underlying form /buSto/, where /S/ lacks a specification for place of articulation and is therefore compatible with either [s] or [h]. Underspecification models predict no difference between words like sopa, busto and remos, whether they are activated by input with [s] or [h]. This is because in both cases, the speech input maps directly to /S/. The data from Experiment 2 do not pattern in this manner, and underspecification models would be hard-pressed to explain why activation levels for busto should exceed those of sopa. "Hybrid" models, such as those suggested by Connine and colleagues, are somewhat different because they adopt certain elements of exemplar theory (Ranbom & Connine 2007;Connine et al. 2008;Pinnow & Connine 2014). The basic idea is that all variant forms have their own representations, which are abstract but crucially accompanied by frequency information. Such a model could make some predictions in a manner similar to exemplar theory, because a word like sopa is "frequently" realized with the [s] variant, while words like busto and remos are "infrequently" realized with the [s] variant. However, because the hybrid model rejects the idea that all individual exemplars are stored in memory, it has no mechanism to create blended representations and therefore no way to account for the generic pattern for which we found evidence in Experiment 2.

Phonological variation and the DRM false memory paradigm
Most word recognition studies activate lexical representations directly, by presenting listeners with matching information in the speech input: that is, they activate a word like sopa by presenting [sopa]. The current study obviously differs from this approach, because we activated representations indirectly, by presenting listeners with phonological neighbors: that is, we activated sopa by presenting [popa], [sepa], [soxa], etc. Our use of the DRM false memory paradigm has several implications, which we discuss here.
To begin, we did not ultimately evaluate the idea that speech input with [s] directly activates sopa more than busto and remos, which would have required us to present inputs like [sopa], [busto], and [remos] and measure the resulting activation. As discussed in the Section 1, the disproportionate role that word-initial segments (here, [s]) play in lexical activation ruled out such a paradigm. Instead, we employed indirect activation to investigate a more general idea, namely that comparable information in the speech input -here, in the form of converging phonological neighbors -activates varying words less strongly than it activates non-varying words. A resulting limitation is that we cannot make complete comparisons between our findings and those that use direct activation methods. Another limitation is that we must assume a transitive relationship whereby weaker (or stronger) activation for a target also produces weaker (or stronger) activation for its neighbors, although several previous studies provide evidence for such a relationship (Underwood 1965;Hall & Kozloff 1970;Arndt & Hirshman 1998;Benjamin 2001;Zeelenberg, Plomp, & Raaijmakers 2003;Kawasaki & Yama 2006). Furthermore, the DRM false memory paradigm differs from more common word recognition paradigms because it includes a monitoring component. That is, even under the assumption that lexical activation for unheard critical items does occur, the recall and recognition tasks still require participants to decide whether that activation originated from an event that actually happened (i.e., from a word that they actually heard) or not. In other words, they must "monitor" the source of their memories before providing a response (M. K. Johnson & Raye 1981;M. K. Johnson, Hashtroudi & Lindsay 1993;Roediger & McDermott 1995;M. K. Johnson 2006). Previous work has demonstrated that manipulating monitoring conditions can alter false memory rates (Israel & Schacter 1997;Schacter, Israel & Racine 1999;Dodson & Schacter 2001), so we must remain open to the idea that differences in monitoring, rather than in activation, could account for our results. Given that no obvious monitoring manipulation occurred in the current study, however, this scenario seems unlikely.
Finally, although we have interpreted our results in terms of activation levels, they also have more general implications for false remembering. It is not unusual for people to experience a memory for an event that did not occur, and false memories have been documented in a range of different contexts (e.g., Loftus 2005;Loftus & Bernstein 2005). False memories triggered by the DRM paradigm are particularly robust and highly replicable (Gallo 2006(Gallo , 2010; they occur, for example, even when the list of neighbors is small (Robinson & Roediger 1997), or when (orthographic) neighbors are presented for very brief durations (Seamon, Luo & Gallo 1998). Given this, the results of the current study are notable for identifying an instance in which the false memory effect is not uniform. If these results eventually generalize, we would have the basis to argue that varying versus non-varying words are crucially stratified according to their relative susceptibility to memory distortions.