Rapid generalization in phonotactic learning

Speakers judge novel strings to be better potential words of their language if those strings consist of sound sequences that are attested in the language. These intuitions are often generalized to new sequences that share some properties with attested ones: Participants exposed to an artificial language where all words start with the voiced stops [b] and [d] will prefer words that start with other voiced stops (e.g., [g]) to words that start with vowels or nasals. The current study tracks the evolution of generalization across sounds during the early stages of artificial language learning. In Experiments 1 and 2, participants received varying amounts of exposure to an artificial language. Learners rapidly generalized to new sounds: In fact, following short exposure to the language, attested patterns were not distinguished from unattested patterns that were similar in their phonological properties to the attested ones. Following additional exposure, participants showed an increasing preference for attested sounds, alongside sustained generalization to unattested ones. Finally, Experiment 3 tested whether participants can rapidly generalize to new sounds based on a single type of sound. We discuss the implications of our results for computational models of phonotactic learning.


Introduction
Natural languages typically place restrictions on the ways in which sounds can combine to form words.
The consonant [h], for example, can occur in the onset of an English syllable, as in half [haef], but not in its coda: English does not have words like *fah [faeh] (McMahon, 2002). English speakers do not typically consider this gap to be accidental; they judge words that end with a [h] as unlikely to become words of the language. The set of all such restrictions is referred to as the phonotactics of the language.
The distinction between phonotactically legal and illegal words is reflected in a variety of implicit tasks, in both adults and infants Jusczyk, Friederici, Wessels, Svenkerud, & Jusczyk, 1993;McQueen, 1998). Sounds with similar articulatory or perceptual properties tend to have similar phonotactic distributions. In German, for example, voiced stops (e.g., [b] or [g]) are not allowed at the end of a syllable: [bal] is a valid German word, but *[lab] is not (Jessen & Ringen, 2002). Speakers often use such class-wide phonotactic patterns to generalize from structures attested in their lexicon to new sounds and sequences.
For example, the onsets [sr] and [mb] are both unattested in English, but [sr] is similar to the attested strident-liquid onsets [sl] and [Sr] while there are no attested sonorant-stop sequences similar to [mb] (Albright, 2009). English speakers judge srip to be a better potential word than mbip (Scholes, 1966;Daland et al., 2011); this suggests that speakers judge novel structures based on similarity to existing structures.
Phonotactic learning studies with adults usually provide participants with considerable exposure to the language. Likewise, models of phonotactic learning focus on the end stage of the learning process, after a large amount of data from the language has been encountered. Conversely, there is little empirical data and modeling work bearing on the time course of phonotactic learning. How much evidence do learners need to begin generalizing to new sounds? Does generalization to novel sounds that have a particular phonological feature (e.g., voiced stops) require multiple attested sounds that have that feature?
In what way does the likelihood of generalization diminish as participants are exposed to more examples of the set of sound sequences that exist in the language? In the rest of this introduction, we describe how the answers to these questions could inform models of phonotactics (Section 1.1) and outline the artificial language learning experiments presented in the rest of the paper, which begin to address these questions (Section 1.2).

The time course of generalization in models of phonotactics
We first review two prominent views of the time course of generalization to new sounds in probabilistic models of phonotactics (see also Cristia & Peperkamp, 2012;Kapatsinski, 2014). In practice, both families of models are quite flexible, and may be modified to accommodate a wide range of results; while we do not intend this paper as a final adjudication between these two types of models, it would be useful to survey these classes of models as a framework with which to evaluate our experimental results.
One family is based on minimal generalization learning (Albright & Hayes, 2003;Albright, 2009;Adriaans & Kager, 2010). In these models, learners generalize to the smallest phonological class that contains the sounds supporting the generalization. In particular, this means that a single sound does not lead to any generalization. Once learners have noticed the commonalities among multiple sounds they have acquired, they form a generalization over the smallest phonological class that contains these sounds; this class can include some unattested sounds. For example, when acquiring the phonotactics of English, learners may first learn that both [b] and [g] are valid onsets for English syllables before they can generalize to other voiced stops (e.g., [d]). This generalization will be restricted to the minimal class that contained the attested onsets (i.e. voiced stops), at least until a voiceless stop onset is encountered.
This assumption, which we refer to as the specific-to-general assumption, is in line with the finding from the artificial language learning literature that infants require three different exposure types to generalize to a novel item (Gerken & Bollt, 2008).
Other models consider representations at multiple levels of generality from the earliest stages of learning, without waiting for multiple sounds to support a particular dimension of generalization (Hayes & Wilson, 2008;Moreton, Pater, & Pertsova, 2015;Linzen & O'Donnell, 2015). In maximum entropy models such as the Hayes and Wilson (2008) model or GMECSS (Moreton et al., 2015), for example, the well-formedness of a sound is derived from a linear combination of the weights associated with each of the phonological classes that the sound belongs to: the well-formedness of a [b] is determined by the weight for [b] and the weight for the class of voiced stops, among the various other classes that [b] belongs to. Learning the phonotactics of the language consists in determining the set of weights that is most consistent with the statistical distribution of sounds in the language. Since both the sound-specific and the class-wide weight contribute to the well-formedness of a [b] token, exposure to this token in learning will cause both weights to be increased. Even if the only token the learner has been exposed to is a [b], then, the learner will still be more likely to judge a novel voiced stop such as [g] as acceptable than a voiceless one such as [k], because attested [b] tokens are taken as support for both the specific sound [b] and for the wider class of voiced stops. This contrasts with specific-to-general models that not predict any generalization from an individual sound.
The predictions that the specific-to-general view makes are quite strong. If the specific-to-general assumption is combined with the assumption that the induction of sound-specific patterns presupposes a particular number of tokens of that sound (Adriaans & Kager, 2010), learning a pattern over a class of sounds requires at least as much exposure to the language as learning a patterns over a single sound; we test this prediction in Experiments 1 and 2a. Furthermore, a minimal generalization learner that has only been exposed to two words in a language, both of which start with a [k], will conclude that all words in the language start with a [k] until shown evidence to the contrary, and will not generalize at all to other sounds. We test this prediction in Experiment 3.

The subset problem and indirect negative evidence
The specific-to-general assumption is a natural solution to the subset problem (Dell, 1981). This problem has been argued to affects learners that can only use positive evidence from attested forms, as is the case for human learners: a learner of English is rarely told explicitly that a particular form (e.g., *mpepm) is phonotactically illegal. To illustrate the problem, suppose that the onsets that the learner has been exposed to are [b], [d] and [g]. This input is compatible with the following two grammars (among others): in Grammar 1, all words start with a voiced stop; in Grammar 2, words can start with any stop, either voiced or voiceless. The language generated by Grammar 1 is a subset of the language generated by Grammar 2. If at one point in the learning process the learner believes that Grammar 1 is correct, and later on encounters a word that starts with a voiceless stop (e.g., [k]), the learner can revise its decision and assume the less restrictive Grammar 2 instead. The reverse decision is prima facie impossible because of the absence of negative evidence: A learner that has chosen Grammar 2 would never receive evidence that the generalization was too wide. 1 To avoid overly broad generalizations, then, learners have to be conservative: "Whenever there are two competing grammars generating languages of which one is a proper subset of the other, the learning strategy of the child is to select the less inclusive one" (Dell, 1981, p. 34). This strategy was later termed the Subset Principle (Berwick, 1985;Hale & Reiss, 2003).
The specific-to-general assumption clearly addresses the subset problem. Yet it is not the only solution to it. It is true that learners rarely receive direct evidence that certain sound combinations are impossible in their language; 2 however, they often do receive indirect negative evidence in the form of frequency asymmetries. Suppose that the learner is exposed to a language in which words start with (a simplified version of Experiment 1). After encountering two words in the language, one that starts with [b] and one that starts with [d], the learner might conclude that the best characterization of the phonotactics of the language is that words can start with any voiced stop. Yet as the learner encounters additional words, all of which starting with [b] and [d], the systematic absence of [g] onsets increasingly argues against the original generalization that any voiced stop can serve as an onset (Tenenbaum & Griffiths, 2001;Xu & Tenenbaum, 2007). This constitutes indirect negative evidence that may cause the learner to favor a narrower hypothesis, in this case that only [b] and [d] are legal onsets.
Most probabilistic models will exhibit this behavior (Hayes & Wilson, 2008;Linzen & O'Donnell, 2015;Moreton et al., 2015) In summary, then, unless additional assumptions are made, models based on indirect negative evidence predict that given frequency asymmetries suggesting that a gap in a generalization is systematic, the learner will conclude that the sound sequence missing from the input is essentially ungrammatical, and will not generalize to it. Models such as the minimal generalization learner, which is not sensitive to such frequency asymmetries, predicts sustained generalization (for simulations, see Linzen & Gallagher, 2014;Linzen & O'Donnell, 2015). We contrast these predictions in Experiments 1 and 2a.

Overview of the experiments
This paper reports the results of four artificial language learning experiments. In Experiment 1, participants were taught a language in which all word onsets had the same voicing (all were voiced or all were voiceless). Participants were divided into four groups, each of which received a different amount of exposure to the language. After the exposure phase, participants judged novel test words for acceptability.
These test words started with one of three types of onsets: onsets that the participants had encountered in the exposure phase (attested); new onsets that had the same voicing as the exposure onsets (conforming); and new onsets that had the opposite voicing from the exposure onsets (nonconforming). To anticipate the results, participants showed evidence of distinguishing voiced from voiceless onsets after very little exposure, but required more exposure to distinguish the onsets they were exposed to from the unattested but conforming onsets.
The language used in Experiment 1 had a categorical, phonetically based pattern. Experiment 2a tested the generality of the findings of Experiment 1 by teaching participants a language with an abstract phonotactic pattern that was not tied to a phonetic feature-specifically, identity between two consonants-using a similar paradigm. Further expanding on Experiment 1, the regularity in Experiment 2a was probabilistic rather than categorical. The results were qualitatively similar to those of Experiment 1: participants showed early distinction between test words with identical consonants and test words with non-identical consonants, followed by a gradually increasing preference for the exposure consonants.
In Experiment 2b, participants were taught a control language whose goal was to verify that the results of Experiment 2a were indeed due to learning rather than a pre-existing preference for consonant repetition. Finally, Experiment 3 showed that learners can generalize a phonotactic regularity to new sounds based on just a single instance of the regularity.

Experiment 1: A natural-class based generalization
The artificial language used in this experiment had a categorical natural-class based phonotactic regularity: all word onsets had the same voicing (either all voiced or all voiceless; different versions of the language were presented to different participants). Following the exposure phase, participants provided acceptability judgments on words of three types: 1. Conforming attested onset (CONF-ATT): words whose onset appeared as the onset of one or more of the exposure words. Since the phonotactic pattern was categorical, all of these onsets conformed to it.
2. Conforming novel onset (CONF-UNATT): words whose onset did not appear as the onset of any of the exposure words, but had the same voicing as those onsets.
3. Nonconforming unattested onset (NONCONF-UNATT): words whose onset differed in voicing from the onsets of all of the exposure words.
All of the exposure words differed from each other, and all of the test words were distinct from  the exposure words. This was the case even for CONF-ATT test words: in that condition the onset was shared with some of the exposure words, but the full word was novel. Exposure sets were constructed which consisted of five words, one with each of the exposure onsets. Participants were divided into four groups; each group was given a different number of exposure sets (one, two, four or eight). For example, participants in the One Set group heard five exposure words, one with each of the exposure onsets, and participants in the Two Sets group heard ten exposure words, two with each of the exposure onsets. A detailed description is given in the Materials section below; see Table 1 for examples.

Materials and procedure
The onsets of all of the stimuli used in the experiment were drawn from the set of six voiced obstruents The list of exposure words was constructed in blocks, such that each consecutive block of five words had exactly one word starting with each of the five exposure onsets. Participants did not receive any indication of the structure of the lists. The order of onsets was pseudo-randomized within each block.
Likewise, the segments selected for the V 1 , C 2 and V 2 slots were pseudo-randomized in consecutive blocks such that each block contained all possible segments for the relevant slot. The test words were presented in two blocks of three tokens, one token for each of the onsets representing the CONF-ATT, CONF-UNATT and NONCONF-UNATT categories, in pseudo-random order (again without indication of the division into two blocks). The vowel pattern and medial consonants were randomized separately for each participant, such that the onsets were the only cue that systematically distinguished the test conditions.

Procedure
All experiments in this paper were conducted using Experigen, a JavaScript framework for running online experiments (Becker & Levine, 2010). Participants were recruited though Amazon Mechanical Turk. Results obtained using Mechanical Turk have been repeatedly shown to replicate established findings from the experimental behavioral research literature (Crump, McDonnell, & Gureckis, 2013).
Participants were paid $0.65 for completing an experiment. They were told that they needed to be native speakers of English to complete the experiment. They were asked for their native language in a short demographic survey at the end of the experiment; data from participants who reported a native language other than English were removed. Participants were limited to those with IP addresses within the United States. We rejected participants who performed multiple experiments or multiple versions of the same experiment, and assigned the task to new participants to reach the intended sample size.
The experiments were split into an exposure phase and a test phase. In both phases, the words were presented in isolation-i.e., not in a continuous stream. Participants were told that the exposure phase would be followed by a test phase during which they will be required to decide if new words sounded like they could belong to the language they were listening to (for a similar task, see Moreton, 2008Moreton, , 2012Reeder, Newport, & Aslin, 2013). During the test phase, the instructions for the task were repeated after every test word. Only two answers were possible: "yes" and "no".

Participants
Six participants completed each combination of the 12 lists and four exposure groups, for a total of 288 participants (72 participants per exposure group). Three participants were rejected because their reported native language was not English. We report data from the remaining 285 participants (116 women, 166 men, three unreported; median age: 30, age range: 18-68, one unreported).

Statistical analysis
Logistic mixed-effects models (LMEM) (Baayen, Davidson, & Bates, 2008;Jaeger, 2008) were fitted to the participants' responses ("yes" or "no") using version 1.1.11 of the lme4 package in R (Bates, Mächler, Bolker, & Walker, 2015). There were only three test conditions, NONCONF-UNATT, CONF-UNATT and CONF-UNATT (there were no NONCONF-ATT test words since the phonotactic pattern held of all exposure words). As such, the design is not fully crossed, and we cannot estimate an interaction term between attestedness and conformity. We therefore treat the condition as a single three-level factor.
We fitted two types of models: full ANOVA models, which included all participants, and simple effect models, which only included participants in a given group (e.g., the Two Sets group). Fixed effects in the full models included group as a four-level factor and onset type as a three-level factor. All factors were coded using sum coding; the main effect of one factor can therefore be interpreted as its average effect across all levels of the other factors. The random effect structure for all models included a by-subject intercept and a by-subject slope for the effect of onset type, as well as a by-onset intercept.
The statistical significance of each term in the model was assessed by comparing the likelihood of the full model to the likelihood of a model that did not include the factor in question, but did include the random by-subject slope for that factor as well as higher order interactions wherever applicable (Levy, 2014).

Results
The mean proportion of test words that participants in each group judged as acceptable in each of the conditions is shown in Figure 1. Visual examination of the results suggests that participants in all exposure groups distinguished CONF from NONCONF onsets in all exposure groups (except, perhaps, the Two Sets group); conversely, only the Two, Four and Eight Sets groups show a distinction between the two categories of CONF onsets (ATT and UNATT).
An ANOVA in the full factorial model, with all four exposure groups and three conditions, found a significant main effect of condition (χ 2 (2) = 84.7, p < .001); the main effect of exposure group did not reach significance (χ 2 (3) = 7.4, p = .06), and neither did the interaction (χ 2 (6) = 11.9, p = .06). Our main interest, however, is in the pairwise comparisons within the three levels of the Onset Type factor, which we turn to next.

CONF-ATT vs. CONF-UNATT
An ANOVA including all exposure groups and only CONF-ATT and UNATT words found a significant effect of onset type (χ 2 (1) = 26.7, p < .001), such that CONF-ATT onsets were more likely to be endorsed than CONF-UNATT ones. The main effect of group was significant (χ 2 (3) = 9.2, p = .02), and so was the interaction between onset type and group (χ 2 (3) = 8.79, p = .03). Separate models fitted within each exposure group (simple effects) showed that this interaction was driven by the absence of a significant preference for CONF-ATT onsets in the One Set group (χ 2 (1) = .95, p = .33), compared with a significant preference for CONF-ATT onsets in the Two and Eight Sets groups and a nonsignificant preference in the Four Sets group (Two Sets: χ 2 (1) = 8.1, p = .004; Four Sets: χ 2 (1) = 3.15, p = .08; Eight Sets: 3 χ 2 (1) = 5.7, p = .02).
How much support do the results of the One Set group results provide for the "null" hypothesis, according to which there is no difference between attested and unattested CONF onsets? We calculated the appropriate Bayes factor using the Bayes Information Criteria approximation (Kass & Raftery, 1995;Wagenmakers, 2007); the result was 10.6, corresponding to a posterior probability of approximately 91% for the null hypothesis assuming a uniform prior. This Bayes factor is characterized by Wagenmakers (2007) as indicating "positive" evidence for the null hypothesis

Differences across counterbalancing lists
As mentioned in the Materials section, the voicing of the onsets of the exposure words was counterbalanced across participants, as was the identity of the held-out consonant. This resulted in 12 lists in total. As a post-hoc analysis, we explored whether there were differences across those lists. Since there were only six subjects in each combination of list and exposure group, we pooled together the lists based on the voicing and manner of articulation of the held-out consonant -e.g., the lists where [p], [t] and [k] were the held-out consonants are collapsed into a single voiceless stop category. Figure 2 plots the results broken down in this way. Differences across the lists appear to be minor, although the high uncertainty (due to the low number of trials) makes it difficult to draw definite conclusions (for example, there appears to be a tendency for the One Set group to distinguish attested from unattested onsets in voiced stop lists).
We repeated the statistical comparison between CONF-ATT and CONF-UNATT words in two ways.
First, we added a manner factor, indicating whether the held-out consonant was a stop or a fricative, as well as the interaction of that factor with condition. The effects of these predictors were not significant in any of the exposure groups. Second, we added a voicing factor and its interaction with condition. The main effect of voicing and the interaction were not significant in the One, Two and Eight Sets groups.
There was a significant interaction in the Four Sets group, such that the effect of condition was larger when the exposure onsets were voiceless (χ 2 (1) = 6.96, p = .008). Since a result restricted to the Four Sets group does not have a clear interpretation, and this was one of a large number of post-hoc tests, we do not comment on this finding any further. A higher-powered investigation of this difference may be an interesting direction for future research.

Discussion
Participants in Experiment 1 were taught artificial languages that had a categorical natural-class based phonotactic regularity: all word onsets had the same voicing (either all voiced or all voiceless, depending on the list). Participants then judged the acceptability of novel words with onsets of three types: CONF-ATT onsets, which were encountered during exposure; CONF-UNATT onsets, which shared the value for the voicing feature with the onsets of the exposure words but were not encountered during exposure; and NONCONF-UNATT onsets, which had the opposite value for the voicing feature than the exposure words.
CONF-UNATT onsets were consistently endorsed more often than NONCONF-UNATT onsets, regardless of the amount of exposure: even after a single set of exposure to each onset type, participants preferred onsets with the same voicing as the exposure onsets to onsets with the opposite voicing. Conversely, participants did not start distinguishing CONF-ATT from CONF-UNATT onsets until after two or more exposure sets. The three-way distinction between CONF-ATT, CONF-UNATT and NONCONF-UNATT words was similar in the Two Set, Four Set and Eight Set groups: despite growing indirect nega-  Figure 2: Endorsement rates for Experiment 1, broken down by the voicing and manner of articulation of the held out (CONF-UNATT) onset. Error bars represent bootstrapped 95% confidence intervals.
tive evidence suggesting that not all conforming onsets occur in the language, participants continued to generalize beyond the attested onsets.
Words that started with a NONCONF-UNATT onset were judged to be acceptable at a fairly high rate (around 60% of the time), even after eight exposure sets. This is likely to reflect the fact that onset voicing is far from the only possible dimension of generalization from the exposure words. Just as all exposure words had the same voicing, they also all started with a consonant, had two syllables, were stressed on their first syllable, and so on. We suspect that test words that differed from the exposure words in more dimensions, such as ulpiuzi or eh, would have been endorsed at a lower rate. The results of the current experiment do not allow us to delineate exactly how far participants would be willing to generalize: the only conclusion we can be confident about is that either they did not generalize to NONCONF-UNATT onsets at all, or if they did, they did so to a lesser extent than to CONF-UNATT onsets.
Participants in Experiment 1 generalized rapidly, before they were able to distinguish the sounds they were exposed to from unattested but similar sounds. They continued to generalize even after as many as eight exposure sets. Experiment 2a tests the generality of that result by applying the same experimental paradigm to a language that differs from the language of Experiment 1 in two ways.
First, generalization in Experiment 1 was supported by a categorical regularity: all of the words in the language had the same voicing. There is evidence that speakers' knowledge of the distribution of sounds in their language is not limited to the categorical distinction between possible and impossible sound sequences; rather, speakers keep track of the relative frequencies of the possible sounds and sound sequences (Vitevitch, Luce, Charles-Luce, & Kemmerer, 1997 in English 4 , the nonword riss, which is comprised of frequent sound sequences, is judged to be a more likely potential word of English than youdge (Coleman & Pierrehumbert, 1997;Jusczyk & Luce, 1994).
Second, the generalization in Experiment 1 was stated over a phonetically defined class of sounds.
While many phonotactic generalizations in natural language are based on the phonetic properties of individual sounds, some generalizations are abstract, in that they do not make reference to the phonetic properties of any particular sound. The simplest type of such a generalization is sound identity (repetition). A large range of studies have shown that such generalizations can be acquired in artificial language studies (Gervain, 2014;Marcus et al., 1999;Moreton, 2012). In natural languages, generalizations that have to do with segment repetition or bans on repetition have been documented in Yucatec Mayan, Hebrew, Peruvian Aymara and other languages (Berent, Marcus, Shimron, & Gafos, 2002;Gallagher, 2013).
To replicate the findings of Experiment 1 and broaden the scope of the conclusions that can be drawn from those findings, then, Experiment 2a tested whether the pattern of results held for a probabilistic abstract generalization. All of the words in the language used in this experiment had the form . Vowels in the language varied freely, and the consonant pairs followed one of eight consonant-specific phonotactic regularities. Four of those regularities involved two different consonants, e.g., C 1 = [k] and C 2 = [s]; two words conforming to this particular regularity are kesa and kisu.
While the phonotactics of the language can be captured precisely using these eight consonant-pair specific regularities, it was also the case that half of the words in the language followed the abstract regularity C 1 = C 2 , much more than would be expected by chance. If participants learned this abstract generalization, they should generalize it to words that contain identical consonants outside of those included in the exposure phase. As mentioned above, numerous studies have shown that participants are able to learn repetition patterns (Endress & Bonatti, 2007;Gerken, 2006;Gervain, 2014;Marcus et al., 1999); our goal is to build on those studies to investigate how generalization to new repeated consonants depends on the amount of exposure to the language.
As in Experiment 1, exposure sets were created that included exactly one word that followed each of the narrow regularities, for a total of eight words per exposure set (see Table 2).

Materials and procedure
All words in the experiment were of the form C 1 V 1 C 2 V 2 , e.g., kesa. The exposure words had one of eight different consonant pairs, four of which were identical and four of which were not (see Table 2).
All participants were presented with 16 testing words, eight with the consonant pairs heard in exposure (ATTESTED) and eight with new consonant pairs (UNATTESTED). Each of the individual consonants C 1 and C 2 in the unattested consonant pairs were encountered during the exposure phase, in both initial and medial position, but not as a combination.  In the exposure phase, participants listened to one, two, four or eight exposure sets. All exposure words differed from each other; that is, the same consonant pair was never heard with the same vowels more than once. As in Experiment 1, the specific words from the exposure phase were never repeated in the test phase. For example, if bagu and biga appeared in the exposure phase, neither could appear in the test phase, but bega could.
All participants were exposed to the same C 1 -C 2 pairs, though the particular words (i.e., the combinations of consonant pair and vowel patterns) differed across participants. Items were pseudo-randomized in blocks as in Experiment 1. In particular, the vowel patterns were randomized for each participant separately, such that the consonant pair was the only cue that systematically distinguished the test conditions from each other.

Participants
A total of 280 participants completed the experiment, 70 in each group. Demographic information was not collected due to a technical failure.

Statistical analysis
As in Experiment 1, we fitted a full model that included participants from all four groups, as well as within-group models for each of the groups. The full model had three fixed effects: one between subjects (the exposure group) and two within subjects (Attestation and Conformity). The random effect structure for subjects in the full model included an intercept and random slopes for Attestation, Conformity and the interaction between the Attestation and Conformity; we also had a random intercept for the consonant pair. As before, p-values were calculated using the chi-square approximation to the likelihood ratio test. Figure 3 illustrates the mean endorsement rates for each group and condition. The full statistical model yielded an effect of group (χ 2 (3) = 25.6, p < .001), reflecting the fact that endorsement rates were higher on average for participants who received more exposure to the language. There was also an effect of Attestation, reflecting higher average endorsement rates for words with ATT than for words with UNATT consonants (χ 2 (1) = 11.7, p < .001), and an effect of Conformity, reflecting higher average endorsement rates for CONF than for NONCONF words (χ 2 (1) = 11.1, p < .001).

Full model
The effect of Attestation was modulated by an interaction with group (χ 2 (3) = 35.6, p < .001), which reflects the fact that participants were better at distinguishing ATT from UNATT items the more exposure they received to the language. The interaction of group and Conformity was not significant (χ 2 (3) = 1.3, p = .73), and neither was the interaction between Conformity and Attestation (χ 2 (1) = .04, p = .85). The interpretation of these findings is complicated by the significant three-way interaction (χ 2 (3) = 8.44, p = .04); Figure 3 suggests that the three-way interaction reflects the fact that as participants received additional exposure sets, the effect of Conformity gradually diminished, but only for test words with ATT consonants; the effect of Conformity was robust for test words with UNATT consonants even in the Eight Sets group.
The Bayes factor in support of the null hypothesis of no Attestation main effect was 33.2. A similar test for the interaction yielded a Bayes factor of 25.7. Both values are characterized as providing "strong" evidence for the null hypothesis (Wagenmakers, 2007).

Discussion
After a single exposure to each of the eight possible consonant pairs, four of which were pairs of identical consonants, participants showed a preference for novel words with identical consonants. This preference held regardless of whether or not the particular pair of identical consonants was presented in the exposure phase. Participants did not start showing evidence of having learned individual consonant pairs until they received at least two exposure sets (i.e., two words with each consonant pair).
As in Experiment 1, participants consistently generalized to CONF-UNATT words even after eight exposure sets. To further explore this sustained generalization pattern, we administered the experiment to an additional group of 70 participants, this time with 16 exposure sets. Since we only had 12 distinct words with each consonant pair, some of the exposure words were repeated twice. It was still the case, however, that none of the test words occurred in the exposure phase.
The endorsement rates for the 16 Sets group were similar to the ones for the Eight Sets group, with the exception that the endorsement rate for NONCONF-UNATT words was more similar to the endorsement rate for those words in the One, Two and Four groups (CONF-ATT: 92%; CONF-UNATT: 79%; NONCONF-ATT: 89%; NONCONF-UNATT: 67%); this suggests that the dip in endorsement rates for NONCONF-UNATT in the Eight Sets group visible in Figure 3 was spurious. The two main effects were significant (Attestation: χ 2 (1) = 29.2, p < .001; Conformity: χ 2 (1) = 8.8, p = .003), but the interaction was not (χ 2 (1) = .7, p = .41; all models were fitted without a correlation term between the by-subject intercept and slopes due to model convergence issues). The simple effect of Conformity was significant within UNATT words (χ 2 (1) = 4.77, p = .03) but not within ATT ones (χ 2 (1) = .05, p = .83). In sum, statistical evidence for generalization to CONF-UNATT words robust even for participants who received 16 exposure sets; the fact that this evidence was weaker than in the Eight Sets group may be an artifact of spuriously low endorsement rates for NONCONF-UNATT words in the Eight Sets group.
In conclusion, participants generalized to unattested consonant pairs after very little exposure to the language, and continued to generalize even after being given ample indirect negative evidence suggesting that only certain consonant pairs can appear in the language.
4 Experiment 2b: Ruling out a pre-existing preference for identity We interpreted our participants' preference for words with repeated consonants in the One Set group of Experiment 2a as reflecting rapid phonotactic generalization. Before being confident in this interpretation, however, we must rule out the possibility that the higher endorsement rate for test items with identical consonants was due to a prior preference for words with identical consonants rather than due to exposure to the artificial language. Such a prior preference could be derived from the participants' native language or from any number of perceptual or cognitive factors (Endress & Bonatti, 2007;Gervain,   Experiment 2b was designed to test for such a pre-existing preference for words with identical consonants. Participants were exposed to eight words, each containing a different non-identical consonant pair.
After the exposure phase, participants judged the unattested items from the test phase of Experiment 2a (both CONF-UNATT and NONCONF-UNATT). An outcome in which participants still showed a preference for identical over non-identical items despite not having seen any identical items in exposure would be consistent with a pre-existing preference for identical items. If, on the other hand, participants showed no identity preference in the test phase, the interpretation of the identity preference in Experiment 2a as being due to learning would stand.

Materials and procedure
All words had the form C 1 V 1 C 2 V 2 , as in Experiment 2a. As in the One Set group of Experiment 2a, there were eight exposure words and 16 test words. All exposure words had two non-identical consonants (see Table 3). Vowel patterns were chosen at random, with no vowel pattern repeated across the exposure and test words. As in Experiment 2a, half of the test words had consonant pairs encountered in exposure (ATTESTED) and half did not (UNATTESTED). All of the test words in the ATT condition had non-identical consonants encountered in the exposure phase. The unattested words in testing had the same consonant pairs as in Experiment 2a, half identical and half non-identical (four of each). For consistency with Experiment 2a, we still use the labels CONF and NONCONF to refer to the test words with identical and non-identical consonants respectively, even though the exposure phase in Experiment 2b did not provide any evidence for the segment-identity generalization. Since no exposure words had identical consonants, there were no CONF-ATT test items; the three test conditions were NONCONF-ATT, CONF-UNATT and NONCONF-UNATT.
The support that CONF and NONCONF test words received from irrelevant natural-class based patterns in the exposure set was matched as follows. Each of the eight consonants in the language appeared in the exposure phase once in initial position and once in medial position. As such, the CONF-UNATT and NONCONF-UNATT test words received equal support from the positional frequency of the individual consonants, as in Experiment 2a. In addition, CONF-UNATT and NONCONF-UNATT test words were matched for the amount of natural class based support they received from consonant co-occurrences in the exposure word (voicing, place of articulation and manner of articulation

Statistical analysis
A LMEM was fitted to the results, with a three-level factor of consonant type (NONCONF-ATT, CONF-ATT, CONF-UNATT) as a fixed effect, as well as random intercepts for consonant pairs and subjects and a random slope by subject for consonant type.

Results
The results of Experiment 2b are shown in Figure 4. Contrary to the predictions of the pre-existing preference hypothesis, participants did not show a preference for CONF-UNATT words; if anything, there was a slight preference for NONCONF-UNATT words over CONF-UNATT ones. There was a striking difference between NONCONF-ATT words and both CONF-UNATT and NONCONF-UNATT words: unlike the One Set group of Experiment 2a, participants in Experiment 2b were much more likely to endorse test words with attested than unattested consonant pairs.
Statistical analysis showed that the effect of condition on endorsement rates was highly significant (χ 2 (2) = 27.6, p < .001). We performed planned comparisons to examine the difference between the different levels of the factor. In line with Figure 4, the difference between NONCONF-ATT on the one hand and CONF-UNATT and NONCONF-UNATT on the other hand (i.e., the two UNATT conditions collapsed together) was highly significant (χ 2 (1) = 27.4, p < .001). By contrast, the difference between CONF-UNATT and NONCONF-UNATT did not approach significance (χ 2 (1) = .57, p = .45).

Discussion
Participants in Experiment 2b, who were not exposed to identical consonant pairs, did not show any preference for novel items with identical consonants (CONF-UNATT). The results therefore support the learning hypothesis, according to which the preference for identical items after one exposure in Experiment 2a was due to learning during the experiment. Thus, our interpretation of the results of the One Set group in Experiment 2a remains unchanged.
The results of Experiment 2b reveal an additional effect. Unlike in Experiment 2a, participants in Experiment 2b showed a strong preference for attested over unattested consonant pairs after just one exposure. While we cannot make firm claims about the source of this difference, one possibility is that the presence of a broad regularity interferes with the learning of narrower regularities. In Experiment 2a, the presence of the identity regularity may have prevented learners from attending sufficiently to the narrower regularities with small amounts of exposure, while in Experiment 2b learners were free to focus on the specific, attested consonant pairs.
At first blush, the lack of a preference for identical items in Experiment 2b compared to Experiment 2a could still be consistent with a pre-existing preference for identical items: The absence of identical consonant pairs from the exposure data could have been taken as evidence for the generaliza- O/E of 4/1. In other words, the evidence for the overattestation of identical pairs in Experiment 2a is much stronger than the evidence for their underattestation in Experiment 2b. It is therefore implausible to assume that the preference for identical items after one exposure in Experiment 2a was due to preexisting preference, and at the same time that the lack of preference for identical items in Experiment 2b was due to learning during the experiment that offset that preference.

Experiment 3: Generalization from a single type
Participants in Experiments 1 and 2a showed evidence of rapid phonotactic generalization. That evidence preceded any evidence that they had learned the narrower, sound-specific phonotactic patterns (i.e., that [k] is an allowed onset). What is the minimal amount of evidence that is required for participants to begin generalizing? In particular, would they generalize based on a single type of onset consonant, or would they wait until they have encountered multiple examples of a phonological class before they begin generalizing to other members of that class, as argued by the minimal generalization hypothesis (Albright & Hayes, 2003;Albright, 2009;Adriaans & Kager, 2010)? Experiment 3 addresses this question by exposing participants to a language in which a particular dimension of generalization is only supported by a single type of sound. If participants still generalized along that dimension, the conclusion would be that learners can generalize based on a single type. participants who were assigned this language were only exposed to the six approximants. This language served to determine whether participants had a pre-existing bias for or against voiceless stop onsets. , which we refer to as APPROX onsets. All onsets were embedded in words of the form . Participants who were taught the Control language were only exposed to the APPROX words (see Table 4c).   Table 4b). All participants received a single exposure set.

Method
The approximants [w], [y] and [l] are considered to be voiced consonants that are neither stops nor fricatives (Hayes, 2011). If anything, these onsets should provide support for the voiced fricative test onsets (NONCONF-UNATT) rather than the voiceless stop ones (CONF-ATT). Any preference for CONF-UNATT over NONCONF-UNATT test onsets, then, would be observed despite rather than because of the APPROX onsets. onsets, and the test phase of the Two Types language had two CONF-ATT and one CONF-UNATT onsets.

Participants
A total of 450 participants were recruited through Amazon Mechanical Turk: 50 participants in each of the three lists for the Single Type and Two Types languages, and 150 participants in the Control language.
Nine participants were rejected because they reported that English was not their only native language.

Statistical analysis
The statistical analysis was similar to the previous experiments, with the exception that our design did not allow us include an onset type random slope for participants, since we only had a single observation per participant for some of the combinations of onset category and language (e.g., there was only one test token with a CONF-UNATT onset in the Two Types language). As such, the random effect structure in all LMEMs reported below only included random intercepts for subjects and for onsets.  Figure 5 shows the mean endorsement rates for each onset type in each of the languages. The design was not fully crossed due to the absence of CONF-ATT onsets from the test phase of the Control language.

Results
Consequently, we performed two separate analyses: one that included all three languages, but only test words with CONF-UNATT and NONCONF-UNATT onsets; and another that included all three onset types, but only the Single Type and Two Types languages.
The significant simple effect in the Single Type language confirms that learners generalized based on a single CONF onset type in exposure. The nonsignificant difference in the opposite direction in the Control language may reflect a tendency to interpret the approximant APPROX onsets in exposure as providing support for voiced over voiceless onsets.

Excluding the Control language
The effect of onset category was significant (χ 2 (2) = 12.8, p = .002); the effect of language was not significant (χ 2 (1) = 0.13, p = .72), and neither was the interaction (χ 2 (2) = 1.22, p = .54). This indicates that the pattern of results is not statistically different across the Single Type and Two Types language.
Finally, we assessed the statistical significance of the difference between words with CONF-ATT and CONF-UNATT onsets within each language separately. Endorsement rates within the Two Types language did not differ across these conditions (χ 2 (1) = 0, p = .96); the numerical preference for CONF-ATT over CONF-UNATT onsets did not reach significance (χ 2 (1) = 2.79, p = .09).

Differences across counterbalancing lists
As mentioned above, the voiceless consonant presented in the exposure phase in the Single Type language was [p], [t] or [k], counterbalanced across participants. As a post-hoc analysis, we explore whether the identity of the voiceless consonant in exposure affected participants' generalization patterns. We plot the endorsement rates in the Single Type language broken down by exposure consonant in Figure 6. The most salient pattern is that the difference between CONF-UNATT and CONF-ATT is clearer when [k] is the exposure consonant than when it is [p] or [t].
We next fit a mixed-effects logit model to the results of the Single Type language. The fixed effects were condition (CONF-UNATT, CONF-ATT and NONCONF-UNATT), exposure consonant (

[p], [t] or [k])
and their interaction. We additionally had random subject and onset intercepts. There was a significant main effect of condition (χ 2 (2) = 6.2, p = .04), but the main effect of exposure consonant and the interaction were not significant (exposure consonant: χ 2 (2) = 1.6, p = .44; interaction: χ 2 (4) = 7.2, p = .13). We conclude that there is no strong evidence of a difference across the counterbalancing lists.  Figure 6: Mean endorsement rates in the Single Type language of Experiment 3, broken down by the voiceless onset in the exposure phase.

Discussion
What are the limits of rapid phonotactic generalization? The minimal generalization hypothesis (Adriaans & Kager, 2010;Albright, 2009) argues that learners need to be exposed to multiple types exemplifying a phonotactic pattern before they can generalize the pattern to new sounds. Experiment 3 tested this hypothesis by exposing participants to the Single Type language, in which two tokens of a single type of voiceless stop onset-e.g., [p]-were the only basis for generalizing to new voiceless stops. Even after this minimal exposure, participants generalized to unattested voiceless stops, preferring them to other onsets such as [z]; the effect was small but significant.
The Control language was designed to rule out two interpretations of the preference that participants who learned the Single Type language showed for CONF-UNATT over NONCONF-UNATT onsets: first, that participants had a prior preference for voiceless stops, either due to statistical patterns in the English lexicon or for any other reason; and second, that the preference for CONF-UNATT onsets was due to the presence of six APPROX onsets in the exposure phase (though the fact that the two classes of consonants share few phonological features makes this scenario unlikely). After exposure to this language, which included only APPROX onsets, participants did not show a significant difference between the two conditions, and if anything judged test words with CONF-UNATT onsets as slightly less acceptable than ones with NONCONF-UNATT onsets. This suggests that the pattern of endorsement rates for the Single Type language was indeed due to generalization from the single type of voiceless stop in the exposure phase.
In a third language, the Two Types language, the well-formedness of voiceless stop onsets was sup-ported by one token each of two different types of voiceless stops, e.g., both [p] and [k]. Participants again generalized to test words with a CONF-UNATT onset; moreover, they did not distinguish CONF-UNATT from CONF-ATT onsets, replicating the One Set group of Experiments 1 and 2a. There was no significant evidence of a preference for CONF-ATT over CONF-UNATT in the Single Type language either.
The two languages differed in that the Two Types language had a single token of each of the two types of voiceless stop, whereas the Single Type language had two tokens of the same type. While this decision served to equalize the number of exposure words across the languages, two tokens of the same onset appear to be sufficient for some onset-specific learning (compare the Two Sets group of Experiments 1 and 2a), which may explain the (nonsignificant) difference in the Single Type language.
In the Control language, which included only six approximant onsets, voiced fricatives were slightly more likely to be endorsed in the test phase than voiceless stops. This difference, if replicated in more highly powered future experiments, may reflect the fact that voiced fricative and approximants are both voiced consonants. This difference reverses in the One Type and Two Types languages and becomes a significant preference for voiceless onsets, even though the exposure phase in those language included only two voiceless stop onsets, compared to the six approximant onsets. In other words, participants were more willing to generalize across voiceless stops than from approximants to voiced fricatives.
This finding can be interpreted as a preference for generalizing to sounds that differ from the exposure sounds in fewer phonological features, or as a preference for generalizing within natural classes that have fewer members (Albright, 2009) -there are three voiceless stops in English, compared to ten voiced consonants.

General discussion
Prior research has shown that speakers can generalize their phonotactic knowledge to novel sounds that share phonological properties with the sounds attested in their language. Similar generalization takes place in artificial language learning experiments: if words in a artificial language often begin with two particular voiceless stops, say [p] and [k], but not with voiced stops, learners will judge novel words that begin with a new voiceless stop (e.g., [t]) as more likely to be words of the language than words that begin with voiced stops. The experiments presented in this paper investigated how generalization to new sounds depends on the amount of exposure to the language that the learner has received.
In Experiments 1 and 2a, participants were divided into four groups that received varying amounts of exposure to an artificial language. In both experiments, participants generalized the phonotactics of the language to words with novel (unattested) consonant patterns, even following brief exposure: they preferred unattested patterns that followed the phonotactic regularities of the language to unattested patterns that did not. By contrast, participants did not start distinguishing the specific sounds they were exposed to from the ones they were not exposed to until they received additional exposure to the language. In other words, participants showed evidence of generalizing (e.g., to new pairs of identical consonants) before they showed evidence of learning any of the specific consonant patterns that supported this generalization (e.g., [p, p]). There was substantial evidence for the "null" hypothesis according to which endorsement rates for attested and unattested consonants were equal following a single exposure set (the In both Experiment 1 and Experiment 2a, the regularity that participants used to generalize to novel sounds was supported by multiple types. In the critical condition of Experiment 3, by contrast, participants were only exposed to a single representative of a phonological class. Even when the amount of exposure to the generalization was severely reduced in this way, participants still generalized to sounds that shared phonological properties with that sound.
The rest of the General Discussion addresses the theoretical implications, limitations and potential extensions of these empirical results. Section 6.1 discusses how the results bear on models of phonotactic learning that are based on phonological classes. Section 6.2 discusses alternative interpretations of our results that do not make reference to phonological classes. Section 6.3 clarifies that our experiments do not allow us to delineate all of the precise generalizations that the participants may have entertained.
Finally, Section 6.4 addresses the differences and similarities between our results and the results of previous studies of phonotactic generalization.

Implications for models of probabilistic phonotactics
Three major empirical observations emerge from our experiments. First, participants can generalize from a single type; second, they generalize before they show evidence of distinguishing attested from unattested sounds; and third, they keep generalizing even after a substantial number of exposure sets (up to 16). The interpretation of the first result is straightforward: it is hard to see how the minimal generalization view could be reconciled with it. The implications of the second and third results for computational models are more complicated, and we discuss them in this section. We limit our discussion to models that view phonotactic learning as consisting of the acquisition of a probabilistic model based on phonological features (Adriaans & Kager, 2010;Albright, 2009;Hayes & Wilson, 2008;Linzen & O'Donnell, 2015).

Generalization before sound-specific learning
In the One Set groups of Experiments 1 and 2a, as well as in the Single Type and Two Types languages of Experiment 3, the endorsement rates for the attested sounds and the sounds that participants generalized to were statistically indistinguishable. This is inconsistent with a straightforward implementation of the specific-to-general assumption, in particular in a model like STaGe (Adriaans & Kager, 2010), in which only statistical patterns that are actively used to make phonotactic decisions (word segmentation in the case of STaGe) can give rise to phonotactic generalizations.
Early generalization can be reconciled with the minimal generalization assumption in a model in which learners avoid applying sound-specific patterns to novel words if the number of exposure words that contained that sound was lower than a certain threshold, but can still use those sound-specific patterns to form phonotactic generalizations (Albright & Hayes, 2002). If that is the case, knowledge about multiple specific sounds from a class might lead to generalization to that class without a difference in acceptability between attested and unattested sounds (see Linzen & Gallagher, 2014 for simulations).
While non-minimal generalization models predict early generalization, they do not necessarily predict an outcome in which novel sounds that follow the generalization are judged as equally well formed as the exposure sounds, as they were in the One Set groups. Without additional assumptions, for example, maximum entropy models such as GMECCS predict that attested sounds should always be preferred to unattested ones, regardless of the amount of exposure (for simulations, see Linzen & O'Donnell, 2015). A single exposure to a [b], for example, leads a maximum entropy learner to increase some of the weights that apply to other sounds such as [d] (e.g., the weight for voiced stops or for stops); but it will also increase the weights of classes that apply to [b] but not to [d], such as the weight for labials or a weight specific to [b]. Consequently, the attested sound [b] would be preferred to the unattested [d].
The prediction of both a generalization and an attestation effect made by a "vanilla" maximum entropy model is consistent with the empirical endorsement rates after multiple exposure sets, but is inconsistent with the pattern that emerged after minimal exposure.
The absence of an attestation effect after limited exposure may reflect a parsimony bias that encourages the learner to represent the input using fewer phonological classes (Linzen & O'Donnell, 2015;cf. Chomsky & Halle, 1968, p. 337). If the learner has been exposed to five different types of voiced onsets (as in Experiment 1), this bias would lead it to characterize words in the language as beginning with voiced consonants-a single generalization-rather than as beginning with [g], [b], [v], [z] or [D] (five separate generalizations). As the learner receives more exposure to the language, however, the absence of conforming unattested sounds becomes more apparent, and prompts it to revert to a less parsimonious but more accurate sound-specific representation. Similar sparsity pressures can be incorporated into maximum entropy models; Hayes and Wilson (2008), for example, implement a feature selection procedure that starts from simpler phonological classes and only adds more complex ones if there is sufficient evidence for them. 5 At first blush, it may seem that the early acquisition of broader classes could reflect a bias in favor of more general patterns (e.g., identical consonants) and against sound-specific ones (e.g., [k, s] were supported by eight exposure words (Figure 3).

Sustained generalization
The fact that participants kept generalizing at the same rate even after multiple exposure sets is problematic to models that are sensitive to indirect negative evidence.  Hayes and Wilson (2008) would be sufficient to simulate the results; we were unable to the run the code available online on our materials since it requires at least 3000 training items. GMECCS, the other published maximum entropy model (Moreton et al., 2015), does not make clear predictions about the relationship between the amount of exposure data and the generalization being acquired; the authors do report gradual convergence towards the target distribution after multiple steps ("trials") of their learning algorithm, but the relationship between the number of trials and the number of observed data points is unclear (hundreds of such "trials" appear to correspond to a single training example). See Linzen and O'Donnell (2015) for an implementation of a maximum entropy model that is sensitive to the amount of training data in a more straightforward fashion. repeated, one would expect them to occasionally be repeated by chance. The Bayesian model of Linzen and O'Donnell (2015) predicts a sharp decline in generalization by the Eight Sets group, in contrast to participants' behavior; maximum entropy models such as GMECCS (Moreton et al., 2015) suffer from a similar problem.
The minimal generalization learner (Albright, 2009), on the other hand, does not implement indirect negative evidence: the probability mass reserved to new sounds does not depend on the number of times the attested sounds have been observed. It can therefore capture the sustained generalization pattern. 6 Yet one would expect there to be a limit to speakers' willingness to generalize; English speakers seem to eventually notice the absence of [h]-final words and stop generalizing to those words from words that end with other fricatives such as [f] or [z]. From the empirical perspective, then, it would be useful to determine how robust the sustained generalization pattern is. Would participants continue to generalize even after hundreds of exposure sets? If at some point participants do stop generalizing, that would support probabilistic models that incorporate indirect negative evidence; however, it would still be an important challenge to understand why those models stop generalizing sooner than humans.

Mechanisms of generalization
The empirical pattern across all experiments was unambiguous: participants showed rapid and sustained generalization to words with novel sounds or sound sequences. A range of proposed psychological mechanisms are compatible with this pattern of results, however. We have focused on an interpretation in which participants judged the test words for acceptability by evaluating whether the test words followed one or more probabilistic generalizations extracted during the exposure phase (Albright & Hayes, 2003;Hayes & Wilson, 2008;Frank & Tenenbaum, 2011;Linzen & O'Donnell, 2015;Moreton et al., 2015).
Yet the same results may be consistent with a view in which participants evaluate the similarity between the consonant pattern of the test word and their memories of the consonant patterns in the exposure words (Goldinger, 1998;Nosofsky, 1986;Redington & Chater, 1996). Such a similarity metric would need to operate over phonological features rather than pure acoustic similarity (Cristia et al., 2013); to account for the results of Experiment 2a, that similarity metric would also need to make reference to the abstract notion of repetition, to prevent [s, s] from being considered more similar to [s, t] than to [t, t]. Once the representational apparatus is equated between the probabilistic abstraction model and the similarity-based exemplar models, however, the two classes of accounts become difficult to distinguish empirically (Barsalou, 1990;Hahn & Chater, 1998); indeed, exemplar models have been interpreted as a process-level implementation of the probabilistic abstraction approach (Shi, Griffiths, Feldman, & Sanborn, 2010). We therefore hesitate to interpret our results as providing support for either mechanistic characterization of generalization.
Did participants generalize using independently represented phonotactic patterns (either rule-based or exemplar-based), or did they use analogy to whole exposure words, matching the test words to their (possibly inaccurate) memories of particular exposure words (Bailey & Hahn, 2001;White, Yee, Blumstein, & Morgan, 2013)? Since our test words were all novel -even CONF-ATT test words differed from all exposure words at least in their vowel patterns -our paradigm does not allow us to probe participants' memory of particular exposure words. We believe, however, is that it is unlikely that participants remembered a significant fraction of the exposure words. Words were never repeated more than once in exposure; the high variability of the vowel patterns (and therefore the particular words) is likely to have encouraged learning of the consonant patterns rather than learning of particular words (Gómez, 2002). Indeed, although it is probably more difficult to remember 64 different words (in the Eight Sets group) than eight different words (in the One Set group), participants in the Eight Sets group showed better learning outcomes than those in the One Set group. It is likely that participants were not particularly motivated to memorize individual words: those words were not paired with a meaning, and the instructions emphasized that the test phase would consist entirely of novel words. Finally, the analogy hypothesis is particularly ill-suited to explain the results of Experiment 2a, where participants generalized to CONF-UNATT words that did not share a single sound with the exposure words (e.g., from pipa to keku). These considerations aside, we acknowledge that the role of memory for particular exposure words is an understudied problem in phonotactic learning experiments; future experiments manipulating the factors mentioned above may be able to distinguish lexicon-based generalization from independently represented phonotactic knowledge.

The extent of generalization
All of the experiments reported in this paper followed the same logic: they tested whether participants preferred novel sounds from a phonological class that contained the exposure sounds to novel sounds outside that class. We would like to caution against interpreting the results as indicating that participants extracted a single phonological pattern from the exposure sounds. For instance, while the results of Experiment 1 indicate only that after exposure to [k t f T p s] learners generalized to other voiceless obstruents (the minimal class that included all exposure sounds), they do not provide evidence that participants restricted their generalization only to onsets that belonged to that class. Indeed, it is plausible that participants would also have generalized to classes that only include some of the exposure sounds, such as dorsal stops (a class that includes the exposure onset [k], but also [g] and others) or fricatives (a class that includes [f], as well as [v] and others), The single-class interpretation is even less applicable for the other experiments: in Experiment 2a, participants generalized the consonant repetition pattern to new sounds even though that pattern only held of half of the exposure words. In Experiment 3, participants were only exposed to a single type of voiceless stop (e.g., [k]); there were clearly not in a position to guess the dimension along which they would be expected to generalize in the test phase (voicing, place of articulation, manner...). Rather, it is likely that participants considered multiple probabilistic phonotactic patterns that were compatible with some or all of the exposure items; in the case of [k] in Experiment 3, those patterns may have included voiceless stops, dorsal stops, dorsal consonants and so on.
The consistently high endorsement rates for NONCONF-UNATT test words -items that did not belong to the narrowest phonological class supported by the exposure words, but nevertheless shared many properties with them -can also be taken to suggest that participants generalized to those words as well, though to a lesser extent than to CONF-UNATT test words. In the future, concrete evidence for graded generalization could be obtained by comparing three or more classes of unattested sounds that are increasingly different from the exposure items; for example, if the exposure sounds were voiced stops, the test conditions might be other voiced stops, voiceless stops, non-stop consonants (e.g., fricatives), and finally vowels.
Finally, in a given exposure group of Experiments 1 and 2a all patterns were represented by the exact same number of exemplars: a participant in the Four Exposures group of Experiment 1, for example, heard exactly four words starting with each of the five onsets. This uniform distribution over attested types may have made the generalization particularly salient, leading to faster and more sustained generalization than would be the case if the distribution was not uniform (for instance, Zipfian); this hypothesis can be tested in future work.

Previous studies of phonotactic generalization
Our finding that participants generalized to new sounds is in line with the results of several other studies (Cristia et al., 2013;Finley & Badecker, 2009;Finley, 2011;Gallagher, 2013). However, those stud-ies tested participants after extensive exposure to the language: 160 words (Cristia et al., 2013), 212 words (Gallagher, 2013) or 120 words (Finley & Badecker, 2009;Finley, 2011). Our study enriches the empirical picture by charting how the generalizations that participants make depend on the amount of exposure to the artificial language, in particular when given a very small amount of exposure: participants in the One Set condition in Experiment 1 received only five exposure words, a fraction of the number of exposure words used in previous studies.
Some learning experiments that used different paradigms from ours have found that participants did not generalize as readily as in our experiments. In a study that assessed phonotactic learning using speech errors in production, participants only generalized phonotactic constraints to new sounds if a period of sleep intervened between the exposure and test sessions (Gaskell et al., 2014). Two studies that examined phonotactic learning in the context of a morphological alternation also did not report generalization to new segments (Peperkamp et al., 2006;Peperkamp & Dupoux, 2007).
Phonotactic learning experiments vary along more subtle methodological dimensions as well. We asked our participants whether they believed that the test words could be part of the language that they had learned. This task is similar to the wordlikeness task used to investigate natural language phonotactics (Coleman & Pierrehumbert, 1997;Bailey & Hahn, 2001) and to the tasks used in learning artificial grammars of word sequences (among many others, Gomez, 1997). Some phonotactic learning experiments have used different tasks. Cristia et al. (2013), for example, asked their participants how frequently they had heard the test items in the exposure phase; even though all test items were novel, participants provided different familiarity judgments to test items of different conditions. It is quite possible that participants generalize more conservatively when judging a test item for familiarity than when judging it for acceptability. In the future, it would be useful to perform a direct comparison across tasks with the same language and training regime.
All of our participants were English-speaking adults. As such, our experiments can be argued to be a closer approximation of second language learning than of first language acquisition. At the same time, we are encouraged by the fact that our findings converge with the results of infant studies. Six month old infants exposed to a language very similar to the one used in Experiment 1 showed a similar behavior to the adult participants in the One Set group of Experiments 1 and 2a: they looked longer at words that started with CONF-UNATT than NONCONF-UNATT onsets, but did not distinguish CONF-UNATT from CONF-ATT onsets (Cristia & Peperkamp, 2012). The infants in that study were exposed to a much larger number of exposure words than the adults in our One Set condition (54 as opposed to five), making it difficult to know how rapidly they generalized. Stronger evidence for rapid phonotactic generalization in infants was obtained in two recent experiments by Gerken and colleagues. Nine-month-olds who have been exposed to a single word with a duplicated syllable (leledi), repeated a few times, preferred novel words with a similar structure, suggesting that they learned a reduplication rule from a single example (Gerken, Dawson, Chatila, & Tenenbaum, 2015); this is consistent with the finding of singletype generalization in Experiment 3. A second study showed that 11-month-olds were able to extract a generalization from only four words (which represented different types), in line with the adults in the One Set condition of Experiment 1 (Gerken & Knight, 2015).

Conclusion
This paper reported on a series of artificial language experiments that investigated the time course of phonotactic generalization. The experiments showed that participants can generalize beyond the specific sounds that occurred in the language following a very short exposure session; in fact, they generalized before they showed evidence of recognizing individual exposure sounds. This was the case regardless of whether the phonotactic regularity that was generalized to new sounds was categorical or probabilistic, and of whether it was based on a phonological class or on an identity relation across segments. Generalization continued undiminished despite growing exposure to the language. Finally, participants were able to generalize to new sounds based on a single type of sound only.
Our results are not fully consistent with any of the existing models of phonotactics: rapid generalization given limited exposure to the language is inconsistent with minimal generalization models (Adriaans & Kager, 2010;Albright, 2009), and the finding of sustained generalization after additional exposure is inconsistent with models that make strong use of indirect negative evidence (Hayes & Wilson, 2008;Linzen & O'Donnell, 2015;Moreton et al., 2015). Our findings can therefore inform the development of more adequate models of phonotactics. More generally, we suggest that models of phonotactics should make explicit predictions concerning the relationship between the amount of training data and the generalizations extracted from it.