Learning vocabulary and grammar from cross-situational statistics

Across multiple situations, child and adult learners are sensitive to co-occurrences between individual words and their referents in the environment, which provide a means by which the ambiguity of word-world mappings may be resolved (Monaghan & Mattock, 2012; Scott & Fisher, 2012; Smith & Yu, 2008; Yu & Smith, 2007). In three studies, we tested whether cross-situational learning is sufficiently powerful to support simultaneous learning the referents for words from multiple grammatical categories, a more realistic reflection of more complex natural language learning situations. In Experiment 1, adult learners heard sentences comprising nouns, verbs, adjectives, and grammatical markers indicating subject and object roles, and viewed a dynamic scene to which the sentence referred. In Experiments 2 and 3, we further increased the uncertainty of the referents by presenting two scenes alongside each sentence. In all studies, we found that cross-situational statistical learning was suf- ficiently powerful to facilitate acquisition of both vocabulary and grammar from complex sentence-to-scene correspondences, simulating the situations that more closely resemble the challenge facing the language learner.

further demonstrated that adult learners can acquire nouns and verbs simultaneously, with nouns being learned more quickly than verbs. Importantly, they showed that cross-situational learning is robust even under conditions of increased ambiguity. Whereas in previous research (e.g., Yu & Smith, 2007), the appropriate referents and labels always co-occurred in a given trial, this was not the case in Monaghan et al. (2015). Here, participants observed two dynamic scenes while listening to artificial language sentences that described only one of the scenes, thus increasing within-trial ambiguity in both the target reference and referent.
Previous studies have provided words from only one or two grammatical categories, and presented these alongside highly constrained possible sets of referents in the learner's environment. In contrast, in natural language learning, the listener hears utterances comprising multi-word sequences composed of words from multiple grammatical categories, and has to learn to map each of these words to objects and actions, their properties and relations between objects and events in the environment. The question remains whether cross-situational learning is powerful enough as a mechanism to resolve the degree of ambiguity in both the utterance and the scene that faces the language learner. The complexity of natural language raises a further difficulty for the learner. Understanding the syntactic dependencies between words and the grammatical roles of those words are necessary before the word's meaning can be determined. For instance, in the case of the distinction between give and receive, the child needs to understand the roles of agent and patient in the syntax before the meaning of the verb can be properly linked to the event, as the event associated with "John gives the book to Mary" and "Mary receives the book from John" will be identical (Childers et al., 2012;Gleitman, 1990;Gleitman et al., 2005). However, grammar learning in turn appears to require knowledge of the meaning of words that constitute the grammatical categories. For instance, determining that English has subject-verb-object (SVO) word order requires identifying that one category of words (nouns) tends to occur before another category of words (verbs) which has a different set of properties. Abstracting over word-order regularities between grammatical categories would not be possible prior to the learning of such grammatical categories.
This "chicken and egg" puzzle has led to solutions proposing that the semantic features that grammatical categories relate to are innately specified, and that language learning requires acquisition of the links between those words and the innate semantic features (Pinker, 1998). Yet, even with these innate semantic features, learning the links between language and the features still seems to require the simultaenous acquisition of the vocabulary and the grammar -to cluster words according to their semantic features requires knowing that those words possess the semantic features in the first place. To avoid this difficulty of simultaneous acquisition, artificial language studies that investigated grammar learning have often pre-trained participants on the language's (pseudoword) vocabulary prior to exposure to the target grammar (e.g., Amato & MacDonald, 2010;Friederici et al., 2002; Morgan-Short et al., 2014).
In this paper, we test whether cross-situational learning is sufficiently powerful to support mapping from multi-word utterances to complex scenes. We also test whether vocabulary and grammar can both be acquired, at the same time, from cross-situational statistics, or whether training on vocabulary is a prerequisite for successful acquisition of grammatical properties of the language. It is known that crosssituational statistics can provide information both about individual nouns and noun categories (Chen et al., 2018), where participants could simultaneously learn labels for individual objects and for the superordinate category to which the individual object belonged. However, it is not yet known whether mappings between sentences comprising words from multiple grammatical categories and scenes depicting multiple potential referents is also learnable.
In Experiment 1, we exposed adult learners to an artificial language consisting of pseudowords from four different lexical categories, which were arranged in accordance with Japanese syntax. Sentences always featured nouns, verbs and grammatical particles that provided information about subject-object assignment. Adjectives were optional and so did not occur in all sentences. Each sentence was presented along with a complex scene that was described by the sentence. We investigated whether words from within each category could be acquired from cross-situational statistics, without providing feedback. We also tested whether participants had acquired knowledge about the grammar of the language from this exposure. In Experiment 2, we further increased the ambiguity of the information to be learned by presenting two scenes along with each sentence, to determine whether cross-situational statistics could resolve the increased complexity of mapping from a multi-word utterance to an unspecified set of referents in the environment. In Experiment 3, we adjusted the training and testing regime to determine whether testing had an influence on learning vocabulary and grammar from cross-situational statistics. While in Experiments 1 and 2 participants were tested repeatedly to determine acquisition sequences of words and syntax, in Experiment 3 participants were only tested after the exposure phase. The data have been made available online on a third-party archive (osf.io/sbxm4).

Experiment 1: Learning vocabulary and grammar from crosssituational statistics
We employed an artificial language containing nouns, verbs, adjectives, and grammatical markers. The syntax of the artificial language was based on Japanese, with a flexible word order, either SOV or OSV.
In this experiment, we tested whether adult learners can simultaneously acquire the vocabulary and the grammar of this more complex language through passive observation of a scene and listening to a sentence describing that scene. Successful completion of the task required participants to learn the correspondence between nouns, verbs, and adjectives with their referents across scenes without feedback and without any explicit information about the structure of the language. Furthermore, it required participants to learn two marker words that reliably identified subject and object in each sentence, as well as acquire the word order itself. If participants are able to resolve these tasks, it suggests that they can track the individual statistical correspondences between particular words and their objects, events or featural property referents across scenes, despite their uncertainty of both the word-referent mappings and of the grammar of the language.

Participants
Twenty university students (Mean age = 22.6 years, SD = 4.1, 15 women) volunteered to participate. All participants were native speakers of English, and none had a background in Japanese or any other verb-final language. Participants were remunerated for their time in accordance with the standard hourly rate of the Department of Psychology at Lancaster University (7 GBP per hour). The study was approved by the ethics review panel of the Faculty of Arts and Social Sciences at Lancaster University and conducted in accordance with the provisions of the World Medical Association Declaration of Helsinki. Sample size was inferred from Monaghan and Mattock's (2012) and Monaghan et al.'s (2015) studies of cross-situational learning of words from single grammatical categories resulting in effect sizes of 0.7, 1.5 and 1.7. If participants are able to learn sentences to scenes cross-situational correspondences, then with 20 participants, power for finding effects in a similar range would be 0.85 to 1.

Materials
Eight alien cartoon characters served as referents to nouns in the artificial language (see Supplementary Materials for images). The aliens appeared in either red or blue and were depicted performing one of four actions (hiding, jumping, lifting, pushing) in animated scenes generated by E-Prime 2.0 (Schneider et al., 2002). Fig. 1 shows a sample scene.
The artificial language contained 16 pseudowords, taken from Monaghan and Mattock (2012) (see Supplementary Materials for list of stimuli). Fourteen bisyllabic pseudowords were content words: Eight nouns (one per alien), four verbs (one per action), and two adjectives (one per colour). Two monosyllabic pseudowords served as grammatical markers that reliably indicated if the preceding noun referred to the subject or the object of the sentence. We opted for this distribution so that the artificial language mirrored natural language properties more closely (see also Monaghan & Mattock, 2012). Word-referent mappings were randomly generated for each participant to control for preferences in associating certain sounds to objects, actions, or colours (see Monaghan & Fletcher, 2019, for discussion). The pseudowords were read and recorded individually by a female native speaker of English in a monotone. The artificial language sentences were assembled by E-Prime, with a 250 ms pause between each word.
The grammar of the artificial language was based on Japanese. Sentences could either be SOV or OSV, i.e. the verb phrase (VP) had to be placed in final position but the order of subject and object noun phrases (NP) was free. NPs contained an optional Adjective (A) prenominally, a noun (N), and a post-nominal grammatical marker that indicated if the preceding noun was the subject ( SUBJECT ) or the object ( OBJECT ) of the action. Adjectives occurred in half the NPs. Sentence length ranged between five and seven words. We generated 192 sentences which were divided into four training blocks each of 48 sentences. Within each block lexical frequencies, subject or object assignment, and word order were balanced. Table 1 summarizes the grammatical sentence patterns that occurred, with equal frequency, in the experiment. A further 96 test sentences were also generated and were controlled in a similar way to the training sentences.

Procedure
Participants were trained and tested on the artificial language over four blocks of training, interspersed with four testing blocks. This allowed us to determine the acquisition order of the different linguistic elements of the system (nouns, verbs, adjectives, marker words, word order). The entire procedure took approximately 45 min.
Training blocks. After providing informed consent, participants were instructed that they would learn a new language, spoken by the "friendly inhabitants of a distant planet". They were not asked to pay attention to any particular aspect of the alien language. Their task was to passively observe a dynamic scene on the screen and to listen to an artificial language sentence describing the scene.
In each trial, participants first viewed an animated scene in which two alien characters performed an action. They then heard the sentence describing the scene, e.g., for the scene shown in Fig. 1: This was then followed by another presentation of the action. The next trial then immediately began. No response was required. There were four training blocks (n = 48 unique trials per block).
Prior to the first training block, participants observed two practice trials with aliens performing actions, accompanied by random pseudoword sequences. The aliens, their colours (green), the actions and the pseudowords used in the practice did not further occur in the experiment.
Test blocks. Each training block was followed by a test block. In each of these, vocabulary learning was assessed first. This was done by means of a two-alternative forced-choice task, in which participants were presented with two animated scenes and played a test sentence. Their task was to decide, as quickly and accurately as possible, which scene the sentence referred to. Words from each lexical category were assessed by varying the target and distractor scenes by one piece of information, such that knowledge of the vocabulary relating to the individual piece of information was required to determine which scene was described by the utterance. Thus, to test noun learning, participants saw two scenes that only differed with regards to one alien character. In the verb test trials, only the actions were different between the scenes. In the adjective test trials, the colours of the aliens were switched. Finally, in the grammatical marker test trials, the subject/object assignment was reversed, though note that understanding the grammatical role markers could be considered part of the grammar rather than the vocabulary. Participants were presented with coloured aliens performing an action and heard a sentence (e.g., "Haagle chelad tha goorshell sumbark noo fisslin", meaning: red alien 5 jumps over blue alien 7 .) (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Table 1 Grammatical sentence patterns that occurred in the experiment.

Word order
First Second Third Rebuschat, et al. Cognition xxx (xxxx) xxxx There were with 24 trials in the vocabulary test. Of these 24 trials, sixteen were used to determine if participants had acquired the words from the different lexical categories (four trials per grammatical category), with a further eight trials used as fillers, in which the distractor scene was randomly assigned and so could vary by several aspects. Trials occurred in randomised order, and no feedback was provided. An example lexical test trial is shown in Fig. 2.
After completing the lexical test trials, grammar learning was assessed next in each of the test blocks. This was done by means of a grammaticality judgment task. Participants were told that they would see a scene and hear a sentence spoken by another alien from a very different planet who was also learning the new language. Their task was to decide, as quickly and accurately as possible, whether the new alien was speaking correctly. If the sentence sounded "good", participants had to press a green button on a computer keyboard. If it sounded "funny", they had to press a red button. There were 16 syntactic test trials in each block. Half the trials followed the grammar of the artificial language, with SOV and OSV sentence patterns carefully counterbalanced. The other half involved sentences with syntactic violations (*SVO, *OVS, *VSO, *VOS). Trials occurred in randomised order, and no feedback was provided on response accuracy.
Of the 160 test items (96 lexical tests, 64 syntactic tests) used in the four test blocks, seven test items had previously occurred in the study materials. One of these was a syntax test item, three were noun test items (distributed across test blocks), one was an adjective test item, and two were marker test items (again in different test blocks). No repetitions occurred in adjacent blocks. Due to the low level of repetition in the study, the lack of proximity between the rare repetitions of sentences, and the substantial number of unique sentences used across training and testing, we believe that memorizing sentences is a highly unlikely strategy to be used by participants.
After the final test block, participants completed a debriefing questionnaire that probed any strategies that participants' may have used for the task, and also completed a language background questionnaire.

Results
Accuracy for acquisition of nouns, verbs, adjectives, marker words, and word order for each of the four test blocks are shown in Fig. 3.
Whether performance was greater than chance was determined by uncorrected one-sample t-tests for each test. We report these results with uncorrected p-values, thus they should be considered in conjunction with the linear contrast effects of test block to illustrate for which language features learning improves. The results are shown in Table 2. By test 4, performance was significantly above chance for all lexical categories, except for marker words, which were significantly above chance only at test 3, though this was not a significant effect after controlling for multiple comparisons. For word order, performance was significantly above chance across all the tests.
In order to determine the rate of learning across the four test blocks, we conducted one-way ANOVAs with test block as within subjects factor on each lexical category and word order separately. For nouns, there was a significant effect of test block, F(3, 57) = 9.24, p < .001, η p 2 = 0.33, indicating that learning increased with additional training.
Polynomial contrasts confirmed that the best fit to the effect of block was linear, F = 20.24, p < .001, η p 2 = 0.52. Quadratic and cubic fits were not significant, p > .09. For verbs, there was no significant effect of test block, F < 1, with performance already high by the first testing block. For verbs there were no significant linear, quadratic, or cubic contrasts, p > .31. A significant learning effect was found for adjectives, F(3, 57) = 3.43, p = .023, η p 2 = 0.15, which was significant in the linear contrast, F = 6.77, p = .018, η p 2 = 0.26, but this was not significant for higher-order contrasts, p > .35. There was also a significant effect for marker words, F(3, 57) = 3.53, p = .020, η p 2 = 0.16, which was a significant linear effect, F = 5.38, p = .032, η p 2 = 0.22, demonstrating that learning the meaning of these vocabulary items improved with cross-situational training exposure. The quadratic contrast was not significant, p = .617, but the cubic contrast was significant, F = 5.19, p = .034, η p 2 = 0.22, but we favour the linear fit because it is a simpler model that accounts for slightly more variance in responses. For word order, more training did not result in significantly improved performance, F < 1, as performance was already accurate in the first block of training. No contrasts were significant, p > .25.

Discussion
The results demonstrated that cross-situational statistics are sufficiently powerful to enable adult learners to acquire multiple lexical categories (nouns, verbs, adjectives) as well as word order patterns, without prior knowledge of either and without feedback. For marker words, though the comparisons against chance indicated that learning was not significant at any individual test point, the learning effect revealed in the ANOVA linear contrast showed that performance was improving with training. In terms of acquisition sequence, our results suggest that participants first acquired verbs and word order (abovechance performance on all four tests), then nouns (above chance from test 2), followed by adjectives. The size of the learning effects also follow this pattern: with large effects of accuracy above chance for verbs and word order, with medium sized effects of learning becoming large effect sizes by the end of training, and medium effects of learning for adjectives and markers by the end of the study. The key result is that all these language features -both vocabulary and word order are learnable as a consequence of information present in cross-situational statistics. However, the precise order of learning of each language feature may be partly a consequence of the particular language structure in the design, and so conclusions about prioritization of particular lexical categories cannot be deduced from the current study. We return to these points in the General Discussion. Given that participants were not instructed to consciously learn the pseudowords or discover the underlying syntactic structure, we assume that most learning was incidental, as a by-product of exposure. 1 In Experiment 1, though the precise mappings between words in the sentence and aspects of the scene were unspecified, the set of referents to which the utterance related were still apparent to the learner. However, precisely which scene, or part of a scene, is being described by an utterance is rarely prespecified in natural language interactions (Cartmill et al., 2013;Clerkin et al., 2017). In the next study, we introduced additional complexity by presenting the utterance with two scenes, only one of which is referred to by the sentence. In Experiment 2, we tested whether adult participants can still learn vocabulary and grammar under these conditions of increased uncertainty of the crosssituational statistical correspondences.
1 Participants were informed that they were going to learn a novel alien language but not instructed to pay attention to any particular aspects of the language or to engage in explicit hypothesis testing. Since it was very obvious from the task that participants were going to learn something new, it made little sense not to mention the alien language at the beginning. In fact, this is not very different from natural language learning in the real world, where adult learners are clearly aware that they are acquiring a new language, either inside or outside of the classroom. Of course, once participants knew they were learning a new language, they could have tried to deploy conscious strategies to boost learning. However, in this case, cross-situational learning should still take place (the process is automatic). Also, given the complexity of the artificial language, this type of strategy would probably not lead to a significantly better performance as there was too much information to consciously recall over multiple learning trials. P. Rebuschat, et al. Cognition xxx (xxxx)

. Participants
Twenty university students (Mean age = 25.2, SD = 4.0, 14 women) participated and received payment for their time. All participants were native speakers of English. None of the participants had a background in Japanese or any other verb-final language, and none took part in Experiment 1.

Materials
The materials were identical to Experiment 1.

Procedure
The procedure was identical to Experiment 1, except that in each trial participants observed two animated scenes (rather than one) while  being presented with an artificial language sentence that matched one of the scenes, see Fig. 4. Locations of target and distractor scenes were counterbalanced and randomised. The distractor scene was randomly selected, with actions and aliens in the distractor scene always different from those in the target scene. This was done to ensure that the randomly generated distractor scene was not identical to the target scene. The next trial began after three seconds. As with Experiment 1, debrief and language background questionnaires were given to participants.

Results and discussion
For each test of lexical category and word order, we conducted onesample t-tests against chance level. The results are shown in Table 3.
One-way ANOVAs with test block as within subjects factor demonstrated: For nouns, there was a marginally significant effect of test block, F(3, 57) = 2.72, p = .053, η p 2 = 0.16, which was significant in the linear contrast, F = 6.37, p = .021, η p 2 = 0.25, but not significant for higher-order contrasts, p > .57. For verbs, there was no significant effect of training, F < 1, and contrasts were also not significant, p > .28. For adjectives, there was again no significant effect, F(3, 57) = 1.21, p = .316, η p 2 = 0.06, not significant also in the polynomial contrasts, p > .13. There was also no significant effect for marker words, F < 1, polynomial contrasts all p > .11, or for word order, F < 1, polynomial contrasts p > .25. Thus, for nouns, verbs, and word order there was evidence of learning by the end of the training, but performance on adjectives and marker words did not demonstrate a clear improvement in performance with training time. As shown in Table 3 and Fig. 3, overall accuracy for the two-screen version of the task seemed to reduce compared to the one-screen version of the task in Experiment 1, with smaller effects of learning evident for all language features. An ANOVA comparing performance on Experiments 1 and 2, with experiment as between subjects factor, language feature (noun, verb, adjective, marker word, word order), and test (1 to 4) as within subjects factors resulted in a significant effect of experiment, F(1, 37) = 4.54, p = .040, η p 2 = 0.11 (η 2 = 0.12), indicating that the increased uncertainty of Experiment 2, though still demonstrating effective learning of nouns, verbs, and word order, resulted in lower overall performance (M = 0.52, SD = 0.11, compared to Experiment 1: M = 0.66, SD = 0.14). There was a significant effect of language feature, F(4, 148) = 16.59, p < .001, η p 2 = 0.31 (η 2 = 0.45), with verbs and word order learned similarly (p = 1.0, all pvalues Bonferroni corrected) and better than nouns (p ≤ 0.012), which were in turn learned better than marker words (p < .001) but were similar to adjectives (p = .90). Adjectives and marker words were not learned significantly differently (p = 1.0). There was a significant effect of test time, F(3,111) = 6.52, p < .001, η p 2 = 0.15 (η 2 = 0.18), which was also significant in the linear contrast, F(1, 37) = 13.03, p < .001, η p 2 = 0.26 (η 2 = 0.35). The interaction between experiment and language feature was not significant, F(4, 148) = 1.28, p = .28, η p 2 = 0.03 (η 2 = 0.03). The interaction between language feature and test time was significant, F(12, 444) = 2.35, p = .004, η p 2 = 0.06 (η 2 = 0.07).
This interaction was due to the significant effect of time found for noun learning, F(3, 114) = 10.39, p < .001, η p 2 = 0.22 (η 2 = 0.27), adjective learning, F(3, 114) = 3.61, p = .016, η p 2 = 0.09 (η 2 = 0.09), and marker word learning, F(3, 114) = 3.00, p = .033, η p 2 = 0.07 (η 2 = 0.08), but no significant effect of time for verb learning or syntax learning, both F < 1, both of which were already learned effectively from the first block of testing. The three-way interaction was not  Fig. 4. Example of a training trial for Experiment 2. One of the scenes matched the sentence, and the other was a foil which could vary over the aliens, their colours, the action, and subject/object roles. P. Rebuschat, et al. Cognition xxx (xxxx) xxxx significant, F(12, 444) = 1.15, p = .318, η p 2 = 0.03 (η 2 = 0.03). Thus, across the two experiments, nouns, adjectives and marker words improved in accuracy with training but verbs, and word order tended to improve at a lower rate given their higher initial learning levels. These effects were regardless of whether there was uncertainty or not about the scene to which the sentence related. Whereas Experiments 1 and 2 show that adult learners can simultaneous acquire multiple lexical categories and the syntactic structure of a complex artificial language via cross-situational statistics, it is not clear whether the interleaving of training and testing blocks may have influenced performance. As mentioned above, in Experiments 1 and 2, participants were tested repeatedly (after each of the four training blocks). This enabled us to determine acquisition sequences, but the existence of these tests might have impacted on performance. In the lexical test trials, the two scenes differ by one aspect -either one of the aliens, one of the actions, the colour of one alien, or the subject and object roles of the aliens. These scenes thus create minimal pairs, i.e. items that differ in one regard only. This focus on particular aspects of the scene may have alerted participants to the language structure, and as a consequence promoted their learning. In natural environments, such minimal pairs are unlikely to occur frequently, and so sensitivity to cross-situational statistics, if it is to support natural language learning, may have to function without these minimal distinctions. In Experiment 3, we test whether both vocabulary and grammar could be learned through cross-situational statistics without exposing participants to these minimal pairs, or any other tests, before the end of training.

Experiment 3: Testing cross-situational statistics without interleaved testing
In this experiment, we repeated the design of Experiment 2, with the exception that the acquisition of vocabulary and grammar was only tested at the end of training. That is, in contrast to Experiments 1 and 2, participants did not complete any test blocks prior to completion of exposure and thus did not encounter any minimal pairs that might have influenced learning (see Fig. 2). If learning from cross-situational statistics is dependent on these minimal pairs in the scenes, then learning of vocabulary or grammar may not be observed for this study. However, if cross-situational statistics are sufficiently powerful for learning without focusing the learner on one aspect of the language or the scene, then learning ought to be observed in this study, as in Experiments 1 and 2.

Participants
Twenty university students participated (Mean age = 21.7 years, SD = 5.0, 12 women), and were paid for their participation. All were native speakers of English. None had a background in Japanese or any other verb-final language, and none participated in Experiments 1 or 2.

Materials
The materials were identical to Experiments 1 and 2.

Procedure
The procedure was identical to Experiment 2, except that there was just one testing block for vocabulary and grammar, which occurred at the end of the training. The testing block was the equivalent in number of trials to a single test block in Experiments 1 and 2.

Results and discussion
We conducted one-sample t-tests and Cohen's d for performance against chance level (0.5) for each vocabulary type and word order in Experiment 3. The results were qualitatively similar to those from Experiment 2. By the end of training, there was evidence of learning for nouns, t(19) = 2.580, p < .0125, d = 0.58, verbs, t(19) = 8.966, p < .001, d = 2.01, and word order, t(19) = 11.485, p < .001, d = 2.57, but no clear learning effect for adjectives, t(19) = −0.037, p > .05, d = −0.01, and marker words, t(19) = −0.427, p > .05, d = −0.10. An ANOVA comparing performance at the final testing block for Experiments 2 and 3, with experiment as between subjects factor, and language feature (noun, verb, adjective, marker word, word order) as within subjects factor resulted in no significant effect of experiment, F(1, 37) = 1.98, p = .168, η p 2 = 0.05 (η 2 = 0.05). There was a significant effect of language feature, F(4, 148) = 16.69, p < .001, η p 2 = 0.311 (η 2 = 0.45), due to verbs and word order learned to a similar degree (p = .805, p-values Bonferroni-corrected) and better than nouns (p = .001), which were in turn learned significantly better than marker words (p = .003) but not significantly better than adjectives (p = .157). Adjectives and marker words were not learned significantly differently from one another (p = .309). The interaction between experiment and language feature was significant, F(4, 148) = 4.00, p = .004, η p 2 = 0.10 (η 2 = 0.11). This interaction was due to testing at the end of training compared to testing interspersed throughout the study enhancing the acquisition of verbs, t

General discussion
There is a chicken-and-egg problem in language acquisition in terms of determining the meaning of vocabulary items while simultaneously discovering the grammatical role of words in the utterance in terms of the language's grammar. Once the learner understands the intended referent of the word then its grammatical category is also evident which can then provide evidence about the grammatical structure of the language. Similarly, knowing the grammar of the language can usefully constrain the possible referents for a given word. For example, if the learner knows that a novel word functions as a noun, then possible mappings to an action or a property of an object are avoided. At the very least, vocabulary and grammar are intertwined, and in some cases it may be essential to acquire one before the other (Gleitman, 1990).
Cross-situational statistics have proven to be a powerful information source available to aid language learning. In the current study, we determined whether previous demonstrations of learning words from one grammatical category (e.g., Chen et al., 2018;Scott & Fisher, 2012;Smith & Yu, 2008) could scale up to acquire mappings between multiword sentences composed of several grammatical categories appearing with complex transitive scenes, a closer approximation to the ecology of the task facing language learners. In this paper, we explored whether cross-situational statistics could provide a possible source of information to explain how vocabulary and grammar can be acquired simultaneously from complex multi-word utterances and scenes that contain depictions of many semantic featural properties relating to the target language. Our study shows that adult learners can solve the chicken-and-egg problem in language acquisition by keeping track of cross-situational statistics. However, it does not allow us to determine if syntactic knowledge preceded lexical knowledge, or whether the two types of knowledge developed in parallel. Further adaptation of our experimental paradigm could be used to address this question by more fine-grained analyses of the point at which knowledge develops of each language feature.
In Experiment 1 we showed that both vocabulary and grammar could be acquired via cross-situational learning. Without explicit information about the grammatical categories, nor the meaning of the individual words, participants were able to acquire information about the word order of the artificial language. As one of our reviewers pointed out, it is not clear what aspects of the word order participants P. Rebuschat, et al. Cognition xxx (xxxx) xxxx actually acquired. For example, to solve the grammaticality judgment trials, participants could attend to the position of the verbs, not paying attention to the flexible word order of subject and object NPs. That is, learning of word order might be quite limited. Nonetheless, it is clear that participants have begun developing a syntactic representation of the language, even if is unclear how well-developed the representation is at this early stage of acquisition. Participants were also able to quickly discover the word-referent mappings for nouns and verbs and, in the latter part of the study, they also displayed learning of both adjectives and subject/object markers. Though performance for grammatical role markers was not significantly better than chance at any individual test point, there was a significant improvement in learning throughout the study, as reflected in the significant linear contrast for the effect of block for marker words in Experiment 1, demonstrating that this aspect of the language was gradually acquired throughout the study. Acquisition of grammatical role markers was a particularly impressive feat, as these do not occur in the participants' native language and because there were no concrete referents in the scenes for these words. The fact that new grammatical terms could be acquired further suggests that learning in these experiments was not limited to the discovery of words belonging to grammatical categories that participants had already acquired in the course of the development of their first language.
Whereas the link between particular words and aspects of the environment was highly ambiguous in Experiment 1, the set of words and the set of possible referents for those words was provided during training. In Experiment 2, we addressed this pre-specification of the set of referents by doubling the level of ambiguity by presenting two scenes in each learning trial, with just one of those scenes relating to the sentence. Even under these conditions of greater uncertainty, participants were still able to reliably acquire the referents for nouns and verbs, and determine the word order in the syntax of the language. Experiment 3 demonstrated that this finding was robust even when there was no testing until the end of the participants' training on the task -so learning was not dependent on attention being brought to bear on minimal distinctions between scenes presented to participants during lexical test trials. However, learning of adjectives and marker words was less stable in Experiments 2 and 3. This absence of evidence for learning may highlight language features related to the relative order of acquisition of different grammatical categories in natural language, where salience in speech, predictability of words and categories in the utterance, and prominence of the referent in the environment may each contribute to learning (Behrens, 2015;Ellis, 2006;MacWhinney, 2012).
Taken together, these studies demonstrate that cross-situational learning is a mechanism sufficiently powerful to acquire words from different grammatical categories and the word order of the language, under conditions of substantially greater ambiguity than present in previous studies of cross-situational word learning. In these previous studies, participants are typically given a set of words from the same grammatical category (Scott & Fisher, 2012;Yu & Smith, 2007) and a set of referents all of which are referred to by one of the words. Even in more challenging cross-situational word learning situations where not all words refer to referents that are present (e.g., Monaghan et al., 2015;Monaghan & Mattock, 2012) or where words refer to either the basic or superordinate category of an object (Chen et al., 2018), the degree of ambiguity is still highly constrained. The ambiguity present in Experiment 1, further increased in Experiment 2 where there are two scenes accompanying a given sentence, demonstrate that learning is still possible even under conditions that resemble more closely children's experience of utterances and of the environments these utterances refer to.
Previous studies training participants on artificial languages have tended to pre-train participants on the vocabulary included in the language (Amato & MacDonald, 2010;Friederici et al., 2002; Morgan-Short et al., 2014). This may be for practical reasons because the focus of these studies was on acquisition of the syntax rather than the vocabulary, but it raises the question as to the extent to which grammar and vocabulary can be acquired simultaneously from the language learner's input (Frost & Monaghan, 2016;Marchman & Bates, 1994), or whether one type of knowledge necessarily precedes the other (Gleitman, 1990;Peña et al., 2002;Pinker, 1998). Whether learning is simultaneous or successive for vocabulary and grammar is not yet answered by the results of our study. Our findings show that both can be learned from information present in cross-situational statistical correspondences, but not whether learning of each property of the language is acquired at the same time. Indeed, there are hints of an order of acquisition in the three experiments. Word order and verbs appear to be acquired earliest, followed by nouns. Adjectives and marker words were only acquired only later during training, showing the smallest learning effect. Such a pattern of results suggests that word order, supported by identifying the verb referents, provided the learner access to the language structure, yet this mutuality in acquisition of word order and verbs might instead suggest that the learner bootstraps gradually from information about both word order and vocabulary, with learning proceeding in tandem.
The effective learning of nouns in the experiments was unsurprising, given that the noun advantage in language development is well-documented in the literature (Gentner, 1982;Imai et al., 2008). The verb advantage in our experiments could be an artefact, in part, of the fixed word order of the language (SOV and OSV), which facilitated identification of the verb in final position -a salient position in speech (Freudenthal et al., 2010;Jones & Rowland, 2017). Once the verb was identified, this could then be used to support learning of words from the other grammatical categories. An additional contributor to the verb learning advantage in the current study was that there were also fewer verbs to learn than nouns (four vs eight). In the case of the adjectives, there were only two words to learn, but this did not transfer into an adjective learning advantage over the other content words possibly in part because this was an optional lexical category, with only half the training sentences featuring adjectives, and likely also because nouns are easier to learn than adjectives in natural language situations (e.g., Gasser & Smith, 1998;Sandhofer & Smith, 2007). Finally, the reduced learning of marker words is worth considering. There were only two marker words, which reliably indicated the subject and object of the sentence, and these occurred in each training trial, and as such they were the most frequently occurring words in the artificial language. Yet, learners displayed relatively little knowledge of these markers. In part, this difficulty in learning marker words could be due to their lower salience. The marker words were monosyllabic (in contrast to bisyllabic nouns, verbs and adjectives) and they only occurred within the utterances, i.e., in less prominent positions than nouns and verbs that could occur at utterance boundaries. It is also worth considering that in our artificial language, there was a 250 ms pause between the nouns and the post-nominal markers, which is in contrast to what happens in natural head-final languages like Japanese. It is conceivable that the marker words could have been more readily learned if they functioned like affixes to the nouns (without pause), but future research will need to explore this possibility. The greater difficulty of marker word acquisition also aligns with studies of child language development and second language acquisition (e.g., DeKeyser, 2005;Shi et al., 2006), where the delay and difficulty of learning function words has been well documented. However, differences in the frequency and the variability of words within different grammatical categories in the current artificial language make determining order of acquisition less straightforward.
Nevertheless, our experiments show that, without explicit instruction as to the grammar, and without explicit prior knowledge of vocabulary, both can be learned by determining the co-occurrences between particular features of a scene and individual words in sentences. The apparent co-dependence of learning vocabulary and grammar -the chicken-and-egg problem of language acquisition (Childers et al., 2012;Gleitman, 1990;Gleitman et al., 2005;Marchman & Bates, 1994;Monaghan & Christiansen, 2008) -is shown to be resolvable by learners tracking cross-situational statistics. Our experiments focused on adult participants, i.e. learners who had already acquired languages and who were likely to possess metalinguistic knowledge of grammatical categories and syntactic relations between them. However, we believe the insights from our study are likely to be relevant to language acquisition by younger learners, too. Previous research (e.g., Scott & Fisher, 2012;Smith & Yu, 2008) has clearly indicated that cross-situational learning can play a role in child language development. Moreover, prior knowledge or experience in acquiring a language in some cases, in fact, hinders learning of a novel system (e.g., Ellis, 2006), so there is no reason to believe that adults would necessarily outperform child learners in this type of experiment. Further studies of cross-situational learning of complex sentence-scene correspondences in infants and children would be necessary to determine the role of this source of information in language acquisition, extending our demonstration of its role in implicit acquisition of multiple language features in adults acquiring a novel language.
CRediT authorship contribution statement P.Rebuschat and P.Monaghan developed the study concept and study design. Testing, data collection and preliminary data analysis were performed by all three authors, with additional testing support. P.Rebuschat and P.Monaghan completed the data analysis and interpretation. P.Monaghan and P.Rebuschat drafted the manuscript, and all authors approved the final version of the manuscript for submission.