Acquiring a language vs. inducing a grammar

Standard computational models of language acquisition treat acquiring a language as a process of inducing a set of string-generating rules from a collection of linguistic data assumed to be generated by these very rules. In this paper I give theoretical and empirical arguments that such a model is radically unlike what a human language learner must do to acquire their native language. Most centrally, I argue that such models presuppose that linguistic data is directly a product of a grammar, ignoring the myriad non-grammatical systems involved in the use of language. The significance of these non-target systems in shaping the linguistic data children are exposed to undermines any simple reverse inference from linguistic data to grammatical competence.


Introduction
For many years, theoretical linguistics, at least in the generativist tradition, and computational linguistics developed along largely independent trajectories.There are a variety of reasons for this, both sociological and theoretical, including the difficulty of integrating the increasingly abstract machinery of theoretical linguistics into computational models and the significant influence of practical applications in computational linguistics. 1 This disciplinary distance is often summarized by quoting the Natural Language Processing pioneer Fred Jelinek: ''Every time I fire a linguist, the performance of our speech recognition system goes up''.However, in recent years, various theorists (e.g.Clark and Lappin (2010) and Yang and Piantadosi (2022)) have argued that an integration should be forthcoming.Specifically, these recent discussions have concerned the question of whether a computational learning system could successfully model the human ability to learn a language.If they can, then these two disciplines should be able to work in tandem, with theoretical linguists describing the grammatical systems acquired by human learners, and computational linguists identifying the learning procedures which enable them to do so.In this paper, I shall voice some reasons for skepticism about this proposed marriage.Specifically, I shall argue that the structure of the problems that these computational systems seem able to solve is in several ways quite unlike the structure of the problems that human language learners face, and thus that drawing analogies or inferences from the former type of system to the ✩ Funding for this research was provided by the Leverhulme Trust, United Kingdom ref: ECF-2020-424.
E-mail address: Ggdupre@ucdavis.edu. 1 See Pullum (2009), Pater (2019), and Church and Liberman (2021) for different takes on the history of the interactions, or lack thereof, between these disciplines.It is important to note that as I use the label 'computational linguistics', it is to be understood in a restricted sense, covering only the use of computational models to study the structures and acquisition of (natural) language.The broad field of computational linguistics includes a wider range of interdisciplinary research, especially applied work, including search engine optimization, sentiment analysis, and more.This more restricted usage has become fairly prevalent in recent years, especially since the rise of Large Language Models.latter is theoretically risky.Many of the points I will make will be familiar, but I hope to present them in a new way.

Grammar induction
To the extent that any cognitive developmental process plays an essential role in explaining how a human being becomes a competent user of their native language, I will call these processes 'language acquisition processes'.This, of course, casts a very wide net.This is intentional, as I do not wish to prejudge the important question of how heterogeneous this collection of processes is at this early stage.This collection can be divided at least along traditional disciplinary lines: human language learners acquire the syntax, semantics, morphology, and phonology of their native tongues.Much of the debate concerning this topic has centered on the extent to which these processes are genuinely learning processes, i.e. whether the products of development are centrally explained by appeal to the structures of the environments in which development occurs.In describing the learning task I will again remain neutral on this point.
The computational models I will be concerned with aim to capture some core component of these processes, and fall under the umbrella term 'grammar induction'.Grammar induction is the process of identifying a grammatical system on the basis of a data set consisting of linguistic strings.These strings are assumed to be generated by a particular grammar, a set of rules or licensing conditions specifying https://doi.org/10.1016/j.cognition.2024.105771Received 13 September 2023; Received in revised form 11 March 2024; Accepted 12 March 2024 which strings are legitimate, and the goal of the learner is to identify just which grammar this was. 2 The label 'induction' here is particularly apposite: these learners bear a clear resemblance to traditional epistemological discussions of inductive learning, which consists in attempting to infer, on the basis of some finite set of observations, the underlying rule (or 'law') governing these observations.A classical inductive learner gathers data about the colors of various birds in their environment, and infers from this data-set to the generalization that all ravens are black.Likewise, a grammar induction model gathers data about, say, the ordering of certain linguistic types (e.g. that encountered adjectives always precede the nouns they modify) and infers that the grammar generating these data incorporates a rule that adjectives and nouns must stand in this relation.As with classical inductive problems, these linguistic inferences are risky.It is logically possible that a grammar which allowed adjectives to precede or to succeed their nouns could have produced only examples of the former, and so an inductive learner could be misled by the data.But, as always, we will ignore the mere possibility of inductive error.
To make things more concrete, I will consider the example of the recent, but already influential, paper by Yang and Piantadosi (2022).Yang and Piantadosi describe an experimental grammar induction model, which they claim is able to correctly learn a wide range of formal languages, varying in the degree to which they are plausibly relevant to human linguistic competence, often in response to surprisingly small data sets.Their experimental set-up can be captured by describing its five major components: 1. Target grammar 2. Sample data-set 3. Generative hypothesis space 4. Bayesian hypothesis selection 5. F score evaluation The target grammar (1) is a formal language which plays two key roles: firstly, it determines which strings are eligible for the learning system's training data, and secondly, it specifies what success for the learning system ought to consist in.A successful language learning system would end up identifying (or, at least, approximating) the grammar which generates the data it is learning from.The target grammars considered in Yang and Piantadosi's experiment vary widely in their complexity (e.g. from the simple a n , generating strings with any number of concatenated a's, to more complex languages such as the mirror language xx R which generates strings in which the second half is the reverse of the first half) and their position on the Chomsky hierarchy.As many of these languages are infinite, determining a countably infinite set of permissible strings, a method is needed to select from this infinite set a finite collection which can be used as the training data for the learning 2 This way of putting things is controversial.I will continue speaking of a 'grammar' in the traditional sense, from generative linguistics, according to which a grammar consists of a set of generative rules which specify which complex structures are legitimate combinations out of a set of basic elements, and the ways that the properties of the former are determined by the properties of the latter.There are a number of competing approaches, such as Model-Theoretic Syntax (Pullum, 2007(Pullum, , 2013)), which ignores how complex expressions are generated and instead states the constraints such expressions must meet in order to be legitimate, or Construction Grammar (Goldberg, 2006) which removes appeal to rules or constraints, instead viewing language acquisition as a process of expanding a store of more-or-less complex signmeaning pairings which the language user can use, through composition and analogy, to construct novel expressions, and more, including hybrid theories featuring aspects of multiple such approaches.While I believe that much of what I say will apply also to these alternative understandings of 'grammar', I am happy to settle for the weaker argument and aim to show that if the general generativist understanding of grammar is the correct one, then grammar induction models are unlikely to be good models for human grammar acquisition.
system.This is where (2) comes in.A sample data-set was drawn from the total collection of possible grammatical strings, ''typically using geometrically distributed string lengths'' (p.5), i.e. shorter strings are more likely to be selected than long strings.Different trials involved training on different sample sizes, ranging from one to 100,000 strings.
One of the key innovations in this paper is (3), the generative hypothesis space.In the search for the most likely grammar responsible for the learning data, the learning system considers a range of hypothesized string-generating programs.These are composed of a small number of simple and plausibly cognitively domain-general operations, such as pair(L,C) which concatenates a string L with a character C. 3 The combination of these basic operations generates the hypothesis space which this learning system considers.As some of these operations are recursive, the hypothesis space is infinitely large.However, as Yang and Piantadosi note (pp.9-10), all but a small number of hypotheses can be effectively discounted after training.Each hypothesis in this space is a program for generating strings, i.e. a string-grammar.Training then consists in comparing the strings generated by these competing hypotheses and those encountered in the training data.A successful hypothesis will be one which makes the encountered strings likely.
This brings us on to (4), Bayesian hypothesis selection.Each hypothesis is assessed on the basis of how likely it would make the encountered training data, in standard Bayesian fashion.Each hypothesis is given a prior probability, determined by the complexity of its program, incorporating a preference for simpler hypotheses (those that can be stated with fewer terms) over more complex hypotheses (those that require longer description lengths).These priors are then updated on the basis of how predictive it is of the encountered training data, sampled from the target grammar.The system thus aims for the optimal trade-off between simplicity and predictive accuracy.Modulo an allowance for some degree of noise (p.3), which could allow a simple hypothesis which predicted most of the training data to be selected over a complex hypothesis which predicted all of the data, the system aims to identify the simplest (i.e.shortest) hypothesis which predicts the training data.Whichever hypothesis is assigned the highest likelihood after this training is then assumed to be (or to be equivalent to) that grammar which actually generated the training data.
Finally, once the learning system has selected a most likely hypothesis in this way, it can be evaluated as to how successfully it has done so.
(5), the F-score evaluation, is a measure of how closely the most likely strings in the training data (2) and the most likely strings according to the highest ranked hypothesis (4) align with one another.An F-score is a harmonic mean of the precision, a measure of how often the selected hypothesis generates strings that are generated by the target grammar, and recall, a measure of how often the target grammar generates strings that are not generated by the selected hypothesis, within a stipulated range.For all but one target grammar, Yang and Piantadosi compare the 25 most likely strings generated by each grammar.So a hypothesis with a perfect F-score will judge most likely just those 25 strings that the target grammar does, but this score can be lowered by either false positives (sub-optimal precision, where there are strings in the top-25 most likely according to the target grammar which are not deemed as likely by the hypothesized grammar) or false negatives (sub-optimal recall, where there are strings in the top-25 most likely according to the hypothesized grammar which are not deemed as likely by the target grammar).Yang and Piantadosi thus use F-score as an operationalization of success in the task of learning the target grammar: the higher the F-score, the more confidence we should have that the grammar has been successfully learned.
Yang and Piantadosi's model thus exemplifies both the common core of grammar induction models, and gives us a comparison point with which to specify the myriad ways that such models can differ.I will start by identifying some such points of variation, before summarizing what I take to be the essential features of such models.
One major distinction between models, relevant to the goals of this paper, centers on analogues of (1), the target grammar, and (2) the training data, and their relationship.One approach to grammar induction, exemplified by Yang and Piantadosi's work, which I will call 'experimental modeling', stipulates the target grammar, and then uses some method to provide a sample of strings generable by this very model.This style of modeling is widespread, and historically central, in computational linguistics, because it allows both an easy and clean way to produce training data, and makes the relationship between the target grammar and the training data maximally clear: the latter is generated by the former.These models are particularly useful for testing the general capabilities of learning systems, as we can see which sorts of grammars they are capable of learning and which sorts they are not.The other approach, which I will call 'corpus modeling', instead gathers the training data from naturalistically produced corpora, e.g. a set of strings trawled from some portion of the internet.Such models have the advantage of being, in some sense, closer to the task of human language acquisition, as the data they are trained on better resembles what the child has access to, and a successful model of this sort has identified a grammar capable of producing something more closely approximating human speech. 4One major downside of such approaches, is that they do not require the modelers to specify the grammar that the learning system is supposed to be learning.Such modeling thus treats behavioral success, i.e. accurately predicting the strings found in the corpus, as a proxy for the cognitive task of identifying the grammar causally responsible for the encountered training data.That such behavioral success is not necessarily a good model for human language acquisition will be one of the core claims of this paper.
The hypothesis space ((3) above) also presents a major source of inter-model variation.As noted, Yang and Piantadosi's model is innovative partially in the way that it delimits the space of hypotheses which the learner traverses.In this model, this space is generated through the combination of a range of computationally simple and general functions.This deviates from the traditional approach in grammar induction, in which all hypotheses under consideration are fully specified in advance, as in e.g.Yang (2002).Of course, within these broad classes, the specific hypotheses under consideration will differ substantially depending on the goals and theoretical persuasion of the researcher.Yang and Piantadosi are interested in demonstrating the power of their system to learn a wide range of grammars, including those known not to describe human language, while other models are often more constrained so as to test the ability of these artificial systems to select among possible human grammars.Alternative approaches to linguistic competence, such as Construction Grammar, can provide alternative sorts of hypotheses to be confirmed or rejected by grammar induction models (see e.g.Dunn (2017)).Grammar induction models can vary also with respect to the ways that probabilistic information is encoded into the hypotheses, i.e. whether the hypothesized grammar itself is viewed as favoring likely over unlikely strings, or whether it merely specifies inclusion or exclusion of a string from the language.
Finally, perhaps the biggest point of variation in grammar induction models centers on the inferential structure of such systems.Yang and Piantadosi's work is representative of the very popular Bayesian approach (see Perfors (2008) for an overview).Within the Bayesian school, there is significant variation in how prior probabilities are assigned, in the extent to which such Bayesian inference is hierarchical or not (i.e.whether over-hypotheses are fixed or subject themselves to updating), and more.But there are also non-Bayesian options, both probabilistic and deterministic.Given the wide range of competing approaches, often determined more by the specific interests or goals of the theorists involved rather than conviction about humanly plausible learning strategies, this is not the place to summarize all of these myriad options.5I will simply note that there is a myriad of options available, and turn to what I think unifies all of these models.
Ignoring the important and substantial differences between different models, the process of grammar induction can be captured by the following features, with the bulk of inter-model difference located within feature 3: 1.The input to the model consists of a set of strings.2. These strings are, or are assumed to be, generated/licensed by some underlying rule or constraint system (i.e. a grammar).3. The system uses some inferential, typically statistical, process (e.g.Bayesian inference) to identify this underlying grammar.
Despite the differences gestured towards above, I will call any model which exemplifies these three traits a 'Grammar Induction' (GI) model.GI thus defines a general class of learning models, each of which might instantiate an inductive system in different ways, relying on different inferences, encoding different biases, and so on.Systems with these features can have many uses.Practically, they have been essential to progress in Natural Language Processing.And theoretically, reliance on such models has plausibly taught us a lot about language and language acquisition.But, these practical successes are insufficient to justify the claim that grammar induction accurately captures how human beings acquire their languages.It is this latter claim that I will be investigating.Grammar induction models are, of course, idealized.They differ in many known ways from the processes humans use in acquiring their native language.Of course, that a model is idealized is not a flaw in and of itself.It is widely assumed that all scientific models are idealized in some ways, and being idealized is no barrier to being useful, or even being accurate, in at least some respects.The pertinent question is: Do these idealizations capture some core truths about the target system?

Structures and strings
My central concern with grammar induction models is with feature 2 above; the assumption that the input-data available to the model is generated by the target grammar.The assumption that a grammar generates strings, which can be perceived and used as an inductive base by a learner, is absolutely core to work in computational linguistics and the formal language theory it builds on.Indeed, it is typical to define a grammar as a device for generating strings.Grishman (1986), a classic introductory text, says ''Formally, a language is a set of sentences, where each sentence is a string of one or more symbols (words) from the vocabulary of the language.A grammar is a finite, formal specification of this set ''. (pp. 12-13).This approach treats natural languages in the way familiar from logic and computability theory.It was adopted also in early generative work (e.g.Chomsky (1975)).
Despite this pedigree, it is important to note that this is, at best, a quite strong idealization.A grammar, as understood in generative approaches to linguistics (see footnote 2), describes not the set of public symbols produced or producible by competent speakers, but the laws governing the language-specialized mental faculty underlying such usage.What a grammar, qua psychological faculty, generates or licenses are not strings, but structured psychological representations.
These hierarchical representations function internally to the mind, and are thus not directly available to the language learning child as data from which to induce a grammar.Significant amounts of processing, by systems which are not specific to language, are required to causally connect a hierarchically structured mental representation to an utterance, a public symbol which can be perceived by others, including language learners.6 These differences can be depicted in the following flowcharts, tracking causal influences in the processes of speaking and learning.Different models are liable to be differently suitable for capturing the core features of different systems, and so which of these depictions best captures the human language learning process will have repercussions for how we should model and theorize about this process.
We can start with the simplest such situation, in which the target grammar, G 1 , directly generates the training data. 7We can call the function mapping the grammar onto the data F 1 .The learner then performs a 'reverse inference', attempting to get back from the data to the grammar which generated it, according to some function F 2 .Speaking loosely, we can view F 2 as the inverse of F 1 .Grammar induction is successful when, after training, What is crucial for our purposes is that models with this sort of structure assume that the grammar is the sole influence on the nature of the elements in the training data.This is precisely the structure of grammar induction in experimental models, wherein the strings from which the system learns are indeed products of the target grammar.But even for corpus models, it is assumed that the system's goal is to try and identify the grammar which generates the strings it encounters, the training data, and so this structure is taken for granted.
However, as noted, the real language-learning situation is quite unlike this.The grammar of competent speakers plays a role in generating data for learners, but is far from the only factor, and nor is it sufficient on its own.A more realistic learning scenario would look more like this, with unlabeled arrows indicating a causal/explanatory dependence relation: G 1 does not, unlike in the previous, and simplified, diagram based on experimental grammar induction models, generate linguistic data.Rather, G 1 generates structured mental representations.These representations are not public, elements of linguistic behavior, but private, of their criterion for successful language learning: ''the learner must not merely uncover the phonological and syntactic rules underlying language structure, but must also master whatever other regularities determine which sentences in the corpus are generated, whether these regularities are pragmatic, semantic, or due to the influence of world knowledge.Thus, this criterion requires that the learner acquire not merely the language, but much else besides''.(p.140) If we retain the assumption that our specifically linguistic capacities are distinct from these other cognitive systems, this gives us strong reason to doubt that a grammar induction model trained on natural language data will identify the grammar of natural language, as opposed to some hybrid system which reflects all of these additional confounds and constraints.Such a system may well be very good at predicting human speech behavior, but without thereby identifying the specific linguistic knowledge which is partially, but only partially, responsible for such behavior.
Short of endorsing the radical view that there is no distinction between knowledge of language and knowledge of the world, knowledge of communicative conventions, goals of communication, etc., a defense of grammar induction models as models of human language acquisition must argue that these learning systems are somehow able to 'factor out' all of these distinct causal sources, and identify the grammar's distinctive contribution to the data set from which they learn.
The options for the proponent of grammar induction, then, will involve providing the child with some means of controlling for these non-target influences on linguistic behavior, so that they can identify the target system, the adult grammar.The most natural ways of doing this are either assuming that these non-target systems are particularly unsystematic, e.g.introducing into the data noise which can be dealt with by smoothing in the learning process, or particularly systematic and predictable, so that some variation in the data can be attributed to these influences in a principled way, leaving only the data indicative of the target grammar to be reflected in the hypothesized grammar.In the next section, I shall elaborate more fully on these non-target processes, and give reason to think their influence on the data will not be plausibly dealt with in these ways.This provides a powerful case against viewing natural language acquisition as well modeled by grammar induction.

Confounds in language acquisition
As the above diagram indicates, what language users actually produce, and thus what language learners can learn from, reflects the workings of a wide range of systems other than the grammar.This of course makes the child's job much more difficult.And it makes clear the ways in which standard grammar induction models are idealized.Showing that a grammar is learnable from a data set generated by that very grammar is a far cry from showing that it can be learned from a data set reflecting the workings of myriad non-target systems.Of course, it is a reasonable strategy to begin working on easier problems, and building up to more complex tasks by de-idealizing the initial models.But some work must be done even in these early stages to provide reason to believe that such de-idealizing will scale up in the ways required to explain the empirical phenomenon under discussion.In this section, I will gesture to the range of such confounding influences on speech, hopefully showing that this requirement is a very strong one.While I will distinguish several such confounding influences, it is clear that these distinctions are somewhat artificial, and that they interact and overlap with one another in many ways.
Before detailing the confounding influences on the linguistic data from which language learners learn, it is helpful to classify a range of different ways in which a data set can be confounded.Retaining our previously adopted idealizing assumptions, what a competent language user has internalized (''what they know'') is a rule or set of rules governing the legitimacy of a set of expressions.And the job of the learner is to identify this rule or set of rules.We can define a pure dataset for a learner as the set of all and only those expressions which are legitimate by the lights of these rules. 11The point of this section is to stress the ways in which the data children have access to in learning a language deviates from the pure data-set.The data children learn from is confounded in a variety of ways.Firstly, and most familiarly, a variety of systems act so as to filter the linguistic data, so that what the child learns from is reflective of less than all of the pure data-set.Secondly, an overlapping set of cognitive systems add to the linguistic data, introducing new material.And thirdly, and perhaps most significantly, the members of the pure data-set may be manipulated so that when the child encounters them they take a different form or have different properties.Prima facie each such confounding influence on the evidence available to the language learning child makes the language acquisition process more difficult.The data they learn from will be both incomplete and misleading as regards the pure data-set, and this makes it less likely that they will be able to infer on the basis of their evidence the rules that actually govern the target system.
The most familiar way in which the actual data on which learners, human or artificial, are trained differs from the pure data-set involves filtering.It is practically definitional of inductive learning that a learner projects their knowledge from some finite sample of their experience.As human language is widely accepted to be infinite, i.e. to allow for an indefinitely large range of legitimate expressions, language acquisition is necessarily a matter of extrapolating from some finite subset of the pure data-set.A fully explicit model of language acquisition must therefore incorporate some discussion of the filtering process that selects, out of the infinite space of such expressions, the finite subset of possible linguistic expressions that a learner actually uses as their evidential base.Filters can take a variety of forms, and while some filtering is a requirement for any learning process, the ways in which a pure dataset is filtered can have serious implications concerning the necessary features of successful language learning systems.As Clark and Lappin (2010) note (p.80), Gold's celebrated proof that Language Identification in the Limit is impossible for suprafinite classes of languages given only positive data depends on the possibility of the data the learner is exposed to being selected by ''an adversary'', ''a malevolent teacher''.Such a filter can make language acquisition impossible, but the relevance of this result to actual language acquisition is less clear (see Johnson (2004) for discussion).Such 'worst case' filters are thus less widely adopted in modern linguistics and cognitive science.More plausible filters will aim to select a subset of the pure data-set which a child is likely to encounter.For example, Yang and Piantadosi (2022), discussed earlier, provide their learning models with a sample of the pure data-set based on a geometric distribution of possible strings on the basis of their length.This reflects the fact that while enormously long sentences may be licensed by human grammars, they are not encountered by language learning children, and so their possibility must be inferred by the learner.Other filters, such as the use of 'motherese', the simplified version of language used in some cultures in addressing children (see Newport et al. (1977) for discussion), likely make language acquisition simpler, but even in such cases it must be shown how the child manages to correctly extrapolate to the more complex cases which have been excluded from the sample.
While filters act to make the training data smaller than the pure data-set, there are other factors involved in language behavior which can make the relationship between these two sets more complex, and often in ways that make the learning process more difficult.The data available to the language learner is not a mere subset of the possible legitimate linguistic expressions, but features a range of illegitimate expressions.As Laurence and Margolis (2001) note, ''Children are constantly within earshot of ungrammatical utterances due to speech errors, false starts, run on sentences, foreign words and phrases, and so on.Among other things, this means that children have to settle on a grammar that actually rejects a good number of the utterances they hear ''. (p. 230).
In addition to adding and filtering, the most serious barriers to grammar induction models are the ways that expressions in the pure data-set are manipulated, modified and recast, before being entered into the learner's training data.In contrast to addition, wherein a novel element uncorrelated with any of the elements in the pure data-set is found in the training data, manipulation occurs when there is such a correspondence, but where some property of the correlated expressions differs.That is, when the expression in the pure data-set has some property which the expression in the training data lacks, or vice versa.Such manipulations will consist in the influence of extra-grammatical cognitive systems which play some role in causally relating a linguistic mental representation to a piece of linguistic behavior, an utterance.As I will discuss, the relationships between phonetics and phonology, and between syntax and linear word order, are prime places for manipulation.For the child to acquire the target grammar, and not merely identify some way of predicting and replicating speech behavior, they must have a way of distinguishing between the rules governing the grammar and those governing how the outputs of the grammar are manipulated.As noted, on standard assumptions, there is a mismatch between the expressions in the pure data-set, which will be structured mental representations, and the publicly observable utterances in the training data.This mismatch in kind means that the entire training set must have undergone some form of manipulation, posing a particularly stark difficulty for a language learner.I will turn now to the specific influences on the training data, highlighting the ways they pose the problems just canvased.

Performance constraints
On the model of the causal influences on speech sketched above, the grammar generates hierarchically structured representations.These representations are mental objects, not public symbols.They play a role, perhaps a particularly central role, in the generation of linguistic data, public symbols which can be perceived by others and used by learners as an evidential base.But, to repeat myself, they do not determine the form of these data, as a variety of other psychological systems influence this data set.
Surely the most widely discussed gap between products of the grammar and linguistic behavior is that produced by what are typically called 'performance constraints'.This rather motley collection comprises a variety of extra-linguistic factors which distort the mapping between the structures generated by the grammar and speech behavior.Many of these are at least partially unsystematic.For example, human speakers frequently lose track of what they are saying, revise their utterances midway through, fumble their words, and so on.Assuming that such mistakes are relatively rare compared to well-formed utterances, especially in child-directed speech (cf.Newport et al. (1977)) patternfinding systems like those used in grammar induction models should be able to deal with them. 12 There are, however, other sorts influences which are plausibly viewed as performance constraints, but which are seemingly less susceptible to straightforward statistical exclusion by the learner.Some of the 'speech shortcuts' that we regularly use in fluent conversation 12 There are other kinds of influence on speech typically grouped under 'performance constraints' which work slightly differently.For example, memory limitations which mean that uttered sentences are typically quite short are usually discussed in the same breath as mid-sentence revisions and slips of the tongue.To incorporate such limitations into our above diagram would require some revision, as these constraints presumably act not on the mapping between representations generated by the grammar and speech, but directly on which linguistic representations are generated.seem to fit this bill.For example, it is widely noted (e.g.Kempson et al. (2001) and Stainton (2006)) that speakers regularly utter mere sentence fragments, rather than fully-formed sentences or even phrases.If the child, as is typical, is modeled as taking an encounter with an expression as evidence that this expression is grammatically wellformed, this steady supply of subsentential expressions may be liable to mislead.A perhaps even clearer case comes from the processes involved in the transduction of a phonological representation into a speech signal, wherein what is actually produced by the motor system conflates and blends the discrete phonological segments and features into a continuously varying soundwave (see the introduction of Liberman (1996) for a review).This produces a particularly clear disparity between the training data and the representations licensed by the target system, wherein the very distinction between component elements (e.g.morphemes or phonemes) in the underlying representations is erased or blurred in the training data.It is worth keeping in mind just how impressive it is that children are able to acquire the relevant representational systems despite this stark degradation of their linguistic data.

Externalization
Perhaps the most revolutionary aspect of the generative tradition, at least as compared to various information-theoretic approaches to language prevalent prior to publications like (Chomsky, 1957(Chomsky, /2002(Chomsky, , 1965) ) and (Miller & Chomsky, 1963), was the heavy appeal to hidden structure.Generative approaches were so fecund precisely because they allowed for the positing of features or components of linguistic expressions (qua mental structures or representations) which were not straightforwardly reflected in the observed properties of utterances.Like any science, linguistic theory posits unobservables in order to explain aspects of the observable data.Phrasal structure, movement, empty categories, the decomposition of words and morphemes into features, and so on are all highly indirectly evidenced in any aspect of linguistic behavior.These properties of, and operations and constraints on, linguistic representations feed into a range of cognitive processes, including motor control of the articulators.It is a very strong empirical assumption that enough information is retained through such processes that a learner would be able to identify these causal sources from these behavioral outputs.Grammar induction models, which view the learner precisely as performing such a reverse inference, are committed to exactly this assumption.In this section, I will detail just some of the ways that the underlying representations have been claimed to differ from the resulting speech behavior, thus making the inference from the latter to the former particularly difficult.
An interesting, if speculative, proposal which has been developed in some recent generative linguistic work is that the compositional rules of grammar are entirely dissociated from the linear ordering of expressions in speech behavior (see e.g.Burton-Roberts and Poole (2006), Chomsky et al. (2019Chomsky et al. ( , 2023))).On the simplest view of this sort, syntax consists of nothing more than the operation Merge, which combines two linguistic items, either basic expressions drawn from the lexicon or complex expressions generated by prior operations of Merge, into one larger item.The application of such an operation determines that its two constituent expressions form a larger constituent, but does not assign an ordering to them such that one is 'prior to' the other, analogous to the way that in mathematics set-formation allows for the construction of complex entities without imposing an ordering on set members.The structures generated by Merge thus contain information about the hierarchy of constituency relations (e.g. if L 1 , L 2 , and L 3 are linguistic expressions, and Merge combines L 1 and L 2 , with the resulting complex subsequently Merged with L 3 , the information that L 3 is a constituent of this new expression, while L 1 and L 2 are merely constituents of a constituent of it is retained) without ordering the constituents (i.e. the expression just described does not tell us that L 3 comes ''before'' L 1 or L 2 ).

G. Dupre
Due to the nature of the articulatory systems, if we are to use such linguistic expressions in acts of communicative behavior, this lack of commitment to linear ordering is an impediment.Speech, for the most part,13 requires that components of the communicated signal occur one after the other, and so there must be some cognitive system which imposes a linear ordering onto these hierarchical structures, which can then feed into motor control of the articulatory systems.This proposal thus points to a broad cognitive division between the generation of linguistic structure, in accordance with the rules/operations of the grammar, and the externalization of this structure, the processes which map grammatical representations onto something producible by the articulatory system.
The problem that such hypotheses about the cognitive architecture of grammar and speech poses for the approaches to language acquisition under discussion is this: what reason do we have to think that a language learner who is simply identifying the best account of the linguistic data they are confronted with will apportion the influences on linguistic behavior to these two distinct systems (structure-generation and externalization) in the same way that our best linguistic theories will?In particular, there may be theoretical and empirical reasons to view some linguistic phenomenon as reflective of externalization processes, rather than the structural features of the grammar itself, but such reasons might be entirely absent from the linguistic data available to the child or machine.In such a case, there will be a conflict between grammar induction models and linguistic theory.
To make the point more concrete, consider one of the core phenomena in linguistic typology: the surface word order of a transitive declarative sentence.A transitive declarative is composed of a subject (S), an object (O), and a verb (V).Most languages display a relatively fixed ordering in which these constituents occur, but this ordering differs from language to language (e.g.SVO in English, VSO in Irish).The above discussion points to the fact that there are two distinct ways of accounting for this cross-linguistic difference.One option is to view this surface difference as reflective of a difference in the underlying grammatical structures made available by the different languages.On this account, we infer from the difference in linear ordering that the hierarchical structures involved themselves differ.Another possibility views the underlying structures as identical, instead locating the cross-linguistic differences in the cognitive processes involved in externalizing these structures.As these approaches differ only in localizing where in the mind cross-linguistic variation is realized, both are at least empirical options, and both have been profitably pursued in generative theorizing.Investigating such hypotheses will involve, inter alia, seeing which proposals provide the deepest explanations of both the linguistic phenomena in particular languages, but also the regularities observed across languages. 14 correct understanding of language acquisition must determine how the child manages to determine, as it were, whether the linguistic phenomena it encounters are reflective of the structure-building operations or of the processes by which structures are externalized.Grammar induction models, by viewing children as aiming to identify the grammar which best explains the data they are exposed to, seem to bias this question in favor of the grammatical difference.It is likely that the grammar that best explains Irish data will be quite unlike the grammar which best explains English data.But this ignores the ways that Irish data can be relevant to grammars for English, and vice versa.That is, from the perspective solely of the data available to the child, the grammars posited by linguists might be quite unlikely, positing surprisingly abstract and indirect relationships between the grammar and the observed sentences.But there may be good reasons, stemming from evidence unavailable to the child such as data from other languages or unattested sentences in the target language, to apportion the explanatory power between the grammar and the externalization processes in just this way.
While the Minimalist thesis that the grammar is limited to hierarchical structure and that the linear order of utterances is determined by post-syntactic externalization processes is an extreme view, the problems it identifies for grammar induction models generalize to a much wider range of theoretical approaches to language.The core worry is that the grammar and the linguistic behavior may display substantially different properties, so that one is unlikely to infer from the latter to the former.This problem will thus be raised by any grammatical theory which deals in things like unvoiced elements (PRO, pro, copies/traces, etc.), movement operations, and so on, as all of these posits involve attributing things to the grammar that are not in any obvious way instanced in the speech behavior.Externalization, for many plausible grammatical theories, thus exemplifies the drastic ways in which linguistic representations are, and must be, manipulated in the process of speaking, thus posing deep worries for any account of the child as performing a reverse inference.

Communicative goals
While it is often noted in the more abstract discussion of linguistic theory, the fact that human linguistic behavior is free, rational action, rarely influences the actual empirical practice of linguistics.This is significant, because the freedom with which we select linguistic behavior appears to transcend the constraints placed by linguistic competence.That is, our communicative goals can lead us to add to the training data examples which would mislead a learner seeking to identify our grammatical system.There are, of course, a wide range of situations in which we deliberately and knowingly utter something we take to be ungrammatical.This can be for aesthetic reasons, as in poetry (e.g.much of ee cummings' work) and humorous word-play, or for practical reasons (e.g. when linguists produce ungrammatical example sentences).We may recognize that what we are saying is not licensed by our own language, but judge that the correct formulation would be too unwieldy.And so on.
Similarly, the fact that we speak in order to achieve a wide range of communicative goals will drastically constrain, i.e. filter, the set of utterances we make, and thus the data set from which learners will extrapolate.The sentences we utter will thus not be a random sample of those we could utter, but will be biased in various ways, towards the short and grammatically uncomplicated end of the distribution.Similarly, the semantic values of expressed sentences will tend towards the true and the mundane, related to the communicative pressures of daily life.And, while the extent to which linguistic utterances tend to be unique is often under-appreciated, it remains true that there is a range of 'stock phrases' which will tend to be over-represented in our speech habits.All of these sorts of factors suggest that a device capable of generating the set of utterances typically encountered may well be quite different from one with the generative capacities hypothesized by generative grammarians.

Knowledge and conventions
Related kinds of cases come from conventions of language which may be independent of the internalized grammar.What we 'know' about language and language use seems likely to be a very heterogeneous class, including alongside traditional 'knowledge of language' (i.e. an internalized grammar) familiarity with a range of facts about patterns relating to communicative practices.This latter class may be conscious or not, but will influence the data we produce for language learners without necessarily reflecting the grammatical system.Adult English speakers know that salt and pepper is natural, while pepper and salt is distinctly marked, but this is not a feature of grammatical competence.It is ''merely conventional''.Our ability to pick up extra-, or perhaps better para-, grammatical knowledge of linguistic conventions seems necessary for accounting for speakers' uses of a variety of constructions which seem unexpected from the perspective of grammatical theory.The child must, somehow, learn both that salt and pepper is the conventional/typical way of speaking, but without inferring that pepper and salt is ungrammatical.GI models, which lump together all the training data in looking for a compact way of generating these data, seem to face difficulties in striking this balance.
For another example, while in English question-formation, auxiliary verbs, but not lexical verbs, are suitable for raising above the sentential subject (Could Andreas run? but not *Runs Andreas?), there are several constructions which seem to allow for the precluded possibility here, such as I kid/shit you not.Analogously, reflexive pronouns like 'myself' require local antecedents (I taught myself but not *I thought you knew this about myself ), but these words can also surface in some attempts to use a higher register or sound more formal (For advice, please contact the Teaching Assistant or myself ).Further, while linguistic theorists have rightly stressed that certain proposed prescriptive rules of grammar are not the place to look for theoretical insight into the nature of language, it is likely that to a certain extent many of these have become part of the communicative routines of particular speakers, who will go out of their way to avoid ending sentences with prepositions, splitting infinitives, and so on.
There is perhaps a sense of 'linguistic competence' in which a child learning a language must, to be competent, master these somewhat idiomatic constructions.But admitting this is not the same as admitting that such constructions must be licensed by the grammar itself.And there may be strong theoretical reasons to enforce this distinction.A grammar for English that allowed for non-locally bound reflexives would be less elegant, and less explanatory, than one that did not, so it may be better to retain the strong generalization (i.e.Principle A of binding theory) as a claim about the grammar, while allowing that the grammar is but one influence among many on speech behavior, with these exceptions explained by other, non-grammatical, sources.But it is precisely the ability to draw such distinctions between different influences on speech behavior that seems to be missing from models of grammar induction, which look to incorporate all information needed to reproduce, and generalize from, the linguistic data into a unified rule system.
All of these factors, and more, pose difficulties for the idea that the child's goal in acquiring a language is identifying a system capable of generating the data they encounter in their linguistic environment.On the one hand, the grammar serves as just one component in the generation of linguistic behavior, making the latter in a variety of ways unreflective of the former.This in turn makes the alleged 'reverse inference' seem particularly difficult.On the other hand, identification of a system which generates the encountered data is likely to require the incorporation of the contributions of all of these extra-grammatical influences.Thus, a system aimed at identifying such a system is likely to identify a system quite unlike those posited by linguists.

Paradox revised
I think the nature of the difficulty I am raising for the idea of language acquisition as grammar induction can be clarified somewhat by (re-)considering some traditional discussions analogizing the process of language acquisition in children to language description by linguists.The idea that language acquisition and linguistic theory construction are in some deep sense processes of the same sort is an old one, endorsed in their quite different ways by Chomsky (1965) and Quine (1960). 15And many paradigmatic approaches to grammar induction seem to fit nicely within the scope of this analogy.Specifically, viewing children as engaged in Bayesian reasoning to identify the grammar most likely to account for their linguistic data makes their project look a lot like the image of science described under the banner of 'Bayesian Confirmation Theory' (e.g.Schupbach (2022), Sprenger and Hartmann (2019)).
We can interpret the 'paradox of language acquisition' (Jackendoff, 1994) along these sorts of lines.Jackendoff asks, if what children are doing is just like what linguists are doing, i.e. learning the rules governing a particular language, then why are children so much better at it?Barring serious pathology or inhumane conditions, all human children, in a relatively short time, manage to master their local languages.On the other hand, an international cohort of thousands of highly intelligent adult linguists, working for decades if not centuries, with the help of massive amounts of data and sophisticated methodologies, are yet to fully specify the complete set of underlying rules responsible for even one human language.Jackendoff, of course, concludes from this that children must have a sizable head start in the process, with their innate (but, crucially, consciously unaccessible) knowledge of language constraining the hypothesis space in ways that make identification of the linguistic rules much easier.
I wish to, in a certain sense, invert these styles of argument, and ask what we should expect to see if children and linguists are indeed operating in roughly the same way, i.e in identifying the best (e.g.simplest, most elegant, most accurate, etc.) grammar for capturing the data available.In asking this question, it will be crucial to note the significant dissimilarities between the tasks facing these two groups.Specifically, the differences between the data being used as an inductive base.While there have long been disputes about the qualities of children's linguistic data, concerning how much of it there is, how degraded it is, what sorts of linguistic facts it makes available, etc., it should be beyond dispute that it is qualitatively quite unlike the data used by linguists.
For one thing, while it may be possible for children to get some evidence that certain expressions are ungrammatical in their local language, this is widely assumed to be indirect negative evidence.That is, the child can conceivably (see e.g.Perfors et al. (2010)) learn from the fact that certain kinds of expression are unattested that they are likely to be ungrammatical (although see e.g.Marcus (1993) and Yang (2015) for concerns with such approaches).But direct negative evidence, explicitly being informed that some strings are unacceptable, seems to be scant in the extreme, and when it is offered, children seem unwilling to use it.Linguists, on the other hand, of course make massive use of such data, explicitly ensuring that their theories do not overgenerate, predicting that some sentences are acceptable when they are taken by native speakers to be unacceptable.And the negative data they use will typically be of a kind very unlikely to show up in a child's linguistic data.While parents might correct their child's use of 'breaked' instead of 'broke', the children are unlikely to get much even indirect evidence contrasting What did Asif tell Roisin that Rogelia broke?from *What did Asif believe the rumour that Rogelia broke?.
Further, linguists will often appeal to particularly rare utterances in defending one theoretical position over another.Parasitic gaps are a paradigmatic example here.The data below (from the seminal (Engdahl, 1983)) exemplify the phenomenon: In 1, we see that the NP 'the paper' is able to serve as antecedent for two separate 'gaps', i.e. 'the paper' is the object argument of both 'read' and 'file'.However, the interpretive relation between the moved NP and one of these gaps is itself dependent on this relation holding between the NP and the other gap.In 2, we see that extraction from the temporal adverbial (of the argument of 'file') is typically impossible.Extraction out of this second gap is possible only when the same argument is extracted out of the first gap (i.e. when this is the argument of the main verb 'read').The former gap is thus said to be ''parasitic'' on the latter.
What matters for our purposes is that expressions of this sort appear to be vanishingly rare in naturalistic contexts.In a 2004 language log post, 16 Chris Potts presents his ''full collection'' of examples found ''in the wild'', consisting of ten examples, including examples from David Foster Wallace and Fidelity Investments, sources from which a language-learning child is unlikely to gain much instruction.Pearl and Sprouse (2013) claim to find zero examples of these constructions in a corpus analysis of child-directed speech consisting of 675,000 words.As these constructions have played a sizable role in shaping debates about the nature of constraints on movement in syntactic theory, this again points to a radical difference between the data being used by children and those being used by linguists, in guiding their inductive/abductive strategies for identifying grammars.
Finally, children, unlike linguists, are limited, almost by definition, to relying on data in their native languages.An English learning child will search for grammars consistent with the English data they encounter.If they encounter data from, say, Turkish, they will either simply ignore it (if it is in insufficient quantities) or they will learn Turkish as a bilingual speaker (if it is in sufficient quantities).Linguists, on the other hand, regularly use cross-linguistic data in defending and developing a theory.Examples of this abound, but notable cases include (Kayne, 1981)'s influential analysis of prepositionstranding, comparing French and English, the development of Abstract Case (e.g.Vergnaud (2006), Hui (1990)), and identification of wh-movement in wh-in-situ languages (Huang, 1982b). 17f the child is indeed attempting to induce a grammar on the basis of the linguistic data they encounter, this raises the question: why would we think they would identify the same grammar as the linguist does, given that they are working from such a radically different evidential base?Prima facie we would not expect a system aimed at capturing the relatively narrow set of single-language sentences to which a child is exposed to end up in the same place as that found in linguistic theory, which is shaped by a broader set of sentences, drawn from a wide variety of languages, and including explicit information about which expressions ought not to be generated.In a sense, the proponent of grammar induction models seems to be betting that these latter sources of evidence (as well as the use of things like neuro-, psycho-, and developmental linguistic experimentation) are irrelevant to the shape that a grammar will take.That is, that a linguist should be able to identify the grammar on the basis purely of the primary linguistic data.Perhaps these further sources of data could serve a heuristic role, but they cannot be essential to determining the shape the grammar takes, as if they were, the child would not be able to induce their own grammar on account of lacking evidence of this crucial sort.But this, of course, flies in the face of the developed methodology and results of theoretical linguistics.

Covert movement: A case study
The general point can be made most clearly with a discussion of a case study in which the methodology of generative linguistics leads to a conclusion about the nature of human language which would be unlikely to be found by standard grammar induction models.I will focus on the case of covert movement.
Since its inception, one of the major focuses of generative grammar has been on the ways that apparently different sentences of a language can be related to one another via processes which manipulate their underlying grammatical structures.Early transformational theories, for example, centered on phenomena like the relations between simple active declarative sentences and their passive, negative, and interrogative counterparts.While this early work posited transformations manipulating entire sentences or trees, such approaches quickly gave way to more constrained operations targeting specific components of grammatical structures, and re-locating these components within them.In the 1980s, it was argued that all such manipulations can be subsumed under the single operation Move , which took a particular element of a structure (a head or a phrase) and re-attached it to some other part of the grammatical tree.Movement, and its constraints, thus became a core focus of grammatical theorizing.
Perhaps the most widely studied instance of this hypothesized movement operation was the movement of wh-expressions (who, what, why, where, etc. and their analogues in other languages) in the formation of questions.In English, analogies between declarative sentences and their open-ended interrogative counterparts strongly suggested that these expression-types were related by just such an operation, moving the wh-expression from the position at which it receives semantic interpretation to a structurally higher position, such that it is pronounced at the beginning of the sentence.Thus, What does Zhangsan think Lisi bought? is analyzed as generated from an underlying structure in which the wh-expression is the direct object of the embedded verb, as in declaratives like Zhangsan thinks Lisi bought sunglasses, but has undergone movement to a sentence-initial position where it is pronounced.Speakers of English master these rules, and so we must attribute to them some underlying knowledge which incorporates operations of this sort.
This hypothesis immediately raised the question of whether this knowledge is a particular feature of languages like English, or whether it could be viewed as cross-linguistically universal, part of what all human children bring to the task of language acquisition.There are theoretical reasons to hope for the former, both from a general preference for wider-ranging hypotheses, as well as learning-theoretic concerns about the acquisition of an operation which is at best indirectly evidenced in the child's experience.However, empirically, this looked implausible, on the basis of so-called 'wh-in-situ' languages, like Chinese and Japanese.In these languages, wh-expressions are found in exactly the same surface locations as their counterparts in declarative sentences.The Chinese equivalent to the question from the previous paragraph is: Zhangsan yiwei Lisi mai-le shenme?, literally: Zhangsan thinks Lisi bought what?(from (Huang, 1995)).
Despite this surface dissimilarity, it has been argued, and widely accepted, that the latter hypothesis, that wh-movement is indeed a universal feature of human languages, is true. 18The key move in defending this claim is to say that in wh-in-situ languages, while the questionparticle does not move overtly, i.e. in ways that have a reflex in the spoken sentence, it does move covertly.On such a view, the English and Mandarin sentences in question share the same underlying grammatical structure, with a question-particle base-generated as complement to an embedded verb but subsequently raised to a sentence-initial position.They differ only in the ways that this shared structure is externalized, or mapped to the articulatory-perceptual systems.
The standard way to motivate such an initially counter-intuitive view is to draw analogies between languages.If some phenomena in English are best explained with reference to identified properties of movement, and we find the same phenomena in Mandarin, then explanatory parsimony favors positing movement in Mandarin.Huang, and many following him, have argued along exactly these lines.
Beginning with (Ross, 1967), generative syntacticians have identified a range of situations in which wh-movement is precluded, despite the questions that would be expressed by such structures (i.e. were they not so precluded) being semantically perfectly sensible.For example, relative clauses and sentential subjects create sentence-internal structures which generally seem to block any attempts at extraction of question particles.Take the sentence, Nnamdi is the man who wrote the article.It is easy enough to imagine a situation in which a speaker knows that Nnamdi is the man in question who wrote something, but does not know what Nnamdi wrote.But the rules for question-formation in English preclude asking what this something is by using the structure that the typical case of question-formation would predict: *What is Nnamdi the man who wrote?.The explanations for such phenomena are stated in terms of constraints on movement.The details do not matter for our purposes, but these explanations appeal to the locality of movement: movement operations cannot simply move any expression anywhere in a syntactic structure, but are constrained to sequences of ''small' movements, typically within a single inflectional-phrase (IP) or determiner phrase (DP).The above attempts at question-formation would require movement of too great a distance within the structure, and are thus precluded, leading to the ungrammaticality noted above.
What Huang noted was that in Chinese languages, despite the fact that the question particles are found in their semantically-expected positions in the uttered sentence, many of the same constraints apply.Movement out of relative clauses, adverbial clauses, and sentential subjects is prohibited (see Huang (1982aHuang ( , 1995) ) for examples and analysis), just as in languages like English in which the dislocation of wh-particles is apparent.The explanation proposed is that these typologically distinct languages actually share an underlying structure.Positing this shared structure thus allows us to explain data from different languages with the same hypothesis, unifying our explanations in the way discussed by philosophers of science such as (Kitcher, 1981).
Linguistic argumentation frequently involves this sort of reasoning, in which claims about the grammar of a particular language are motivated by appeal to grammatical hypotheses concerning other, unrelated, languages.This keeps explanatory posits to a minimum.It is quite unlikely that anyone would have posited wh-movement based purely on analysis of data from Chinese or Japanese.But, the fact that wh-movement is explanatory of data in English shows that this is a possible operation for human languages, and thus can plausibly be appealed to in explaining surprising data in these other languages.
The problem for grammar induction approaches to language acquisition is to make sense of this fact.Induction of a grammar is a matter of identifying the best (e.g.simplest) way of capturing the data the learner has been exposed to, which is, invariably, data taken from the learner's own language.Facts about wh-movement in English can play no role in the process of a Chinese speaker inducing the grammar of Chinese.But if the above arguments are on the right track, evidence from English is relevant to theories about the grammar of Chinese.It would thus be somewhat miraculous if learners and linguists managed to identify the same underlying grammars, despite drawing their inferences from radically different data-sets.But it does not make sense for the linguists and the learners to end up in different places: what the linguists (are attempting to) describe just are what the learners learn.So, something has to give.Either theoretical linguists are systematically adopting methodologies which fail to get at their targets, and being misled by the appeal of explanatory unification, or children are not learning their languages in the ways modeled by grammar induction models.
Of course, the analysis, and even the existence, of covert movement is controversial.See again Bayer and Cheng (2017) for a survey.While it might somewhat reduce the worries for a grammar induction model of language acquisition if it turned out that there was no covert movement, and so the relevant features of specific languages were closer to the surface and thus more suited for an inductive learner, the crucial methodological point remains however this debate turns out (and, of course, covert movement is far from the only case in which sources of data outside of those available to the learner are leveraged in theory-selection in linguistics).Linguists' arguments for one grammatical analysis over another depend on, and are confirmed by, the comparison of a range of data, in this case drawn from a range of different languages, to which an inductive learner has no access.Thus, it would be highly surprising if the inductive learner identified the same grammar as the linguists; amounting to the claim that these crosslinguistic analyses were strictly unnecessary to uncover the correct grammatical theory.

If not induction, then what?
If the argument above is on the right track, the obvious question is: what, if not grammar induction, is involved in language acquisition?More precisely, what approaches to language acquisition are likely to converge with the grammars identified by theoretical linguists?I do not have a fleshed-out theory of language acquisition, and if I did, this would not be the place to expound it.But there are a few related directions in which I can point, which might be fruitful in developing such an account better served to show how children acquire the grammars that (generative) linguists say they do.
The first, most obvious and familiar, point is that if information that is not available to the language-learning child is nonetheless relevant to the shape of the adult grammar, then it must, somehow, be reflected in the child's innate endowment.If Chinese grammars are indeed reflective of phenomena that are only made evident in English, then the Chinese-learning child must have a way of accessing these facts without experience of English.This must thus be a feature not of learning in the traditional sense, extrapolation from observation, but must instead be a feature of some other, non-data-driven form of development, such as innate constraints or more general features of cognitive development.
Of course, grammar induction models are not intrinsically opposed to the kinds of nativist assumptions suggested here.Indeed, it is common-place to note that all models of acquisition of any sort are, and must be, 'innately' constrained in some way or other. 19The question is always: what must be innate in order to learn what is in fact learned?And I do not doubt that some forms of inductive inference are relevant to some aspects of language acquisition.The argument above can point the way to such a division of labour.Where the best hypothesis about the nature of the grammar contains features which are motivated by data not available to the language learning child (e.g.data from other languages, from the unavailability of a given expression, from neuro/psycho-linguistic experimentation, etc.), such features are not products of induction.Other features, however, of the target language may be sufficiently evidenced in the grammar, and thus could be viewed as inductively learned.Surface word order (as distinct from underlying constituent structure), regular morphology, phonotactics, etc. seem like plausible examples in this latter category.Those aspects of developed grammars which are not evidenced in the training data, however, would need to be incorporated into the learning model itself, as constraints on the hypothesis space, priors and meta-priors, etc.
Another, related, option can be drawn from the range of noninductive learning processes, such as parameter-setting, particularly 'trigger' based models. 20Triggering models of language acquisition differ drastically from GI models, as they make no commitment to the claim that the parameter settings adopted by the child will be able to replicate or predict the training data from which the relevant triggers were drawn.In these kinds of models, children are not seeking to find the best way of systematizing their linguistic data, but are rather looking for a small number of crucial pieces of evidence indicative of some deep, and general, features of their language.If, in line with the nativist approach just described, the child takes for granted that there will be wh-movement in its language, all it needs to do is determine whether (and when) such movement is overt or covert.
Sakas et al. ( 2017) provide a relatively recent overview and defense of an approach along these lines.As they note, the parameter-setting view has ''fallen on hard times'' in recent years (p.393).However, many of the criticisms of this view seem to assume that it is doing something very similar to induction models, i.e. accounting for all observed linguistic variation.E.g.Newmeyer (2017) complains that parameters will need to be highly language-specific ('microparameters') and thus there will have to be implausibly many of them, undermining claims to have suitably reduced the complexity of the innate component of human language so as to make such proposals better-suited for evolutionary and developmental biology.This objection is reasonable if the goal of fixing parameters is to identify a linguistic system capable of generating all and only the utterances deemed acceptable or used by a given language user.But, as I have argued, this should not be the goal of developmental linguistic theory.Linguistic behavior is likely too confounded for a single psychological system to capture it accurately.If we instead view parametric theories as capturing only one determinant of this messy behavior, with lots of both inter-linguistic and intra-speaker variability a product of extra-grammatical systems, then such theories may be more plausible as they may be able to capture the variation actually reflective of underlying grammatical differences, without being cluttered by the need to account for all observed linguistic behavior. 21hese two points, and this article more generally, share the theoretical stance that language acquisition ought to be viewed as a quite disunified phenomenon.It is common in grammar induction models to assume that a single learning process results in the acquisition of all the knowledge, skills, and competence involved in producing linguistic behavior similar to that found in one's primary linguistic data.Piantadosi (2023) presents this view in extreme form, arguing that all aspects relevant to the production of language can be acquired by a single data-driven, Deep Learning system which ''integrate [s] syntax and semantics in the under-lying representations'' (p.15). 22This approach thus conflicts with the standard assumptions in theoretical linguistics (c.f.Chomsky (1957Chomsky ( /2002))) that each linguistic level (at least, syntax, semantics, and phonology) operates within a dedicated system, over proprietary symbols.
If the above arguments are right, it is likely that the differential operation of these many different cognitive processes is essential to ensuring that the child identifies the genuine rules of their language, rather than merely acquiring the ability to reproduce data resembling the strings they have encountered.The grammar, in the sense of the rules governing the construction of hierarchical constituency structures, may be mostly innate, and thus not learned at all, or may be responsive to a limited amount of parametric variation.The mapping between grammatical structures and linearized utterances, however, may feature a substantial degree of induction, although within constraints provided by the grammar.And a whole host of other ''pieces'' of linguistic behavior, including conventional or unmarked forms of speech, idioms, loan-phrases, and various other constructions, are plausibly viewed as products of more general forms of knowledge acquisition, perhaps influenced to some degree by the core grammatical features of language, but not substantially constrained by it.This divide-and-conquer approach may shear off enough of the complexity in linguistic behavior that the underlying grammar can remain in line with that proposed by generative linguists.23

Conclusion
Theoretical linguistics provides empirical and theoretical reasons for positing substantial distance between observable linguistic behavior and underlying grammatical structure.Inferences from the former to the latter are thus often informed by a range of data and assumptions which are not available to the language-learning child.This poses a deep problem for models of language learning as grammar induction, which view the language learning child as attempting to identify the linguistic system which makes the best sense of the linguistic data they have encountered.If there were a plausible way to infer solely from primary linguistic data to a target grammar, why could linguists not simply make use of such an inference, rather than drawing on data from other languages, experimental research, judgements about unacceptable (and thus unuttered) sentences, coherence with other scientific theories, and much else?There is thus good reason to think that the grammar that would be induced in a grammar induction model will be quite different from the grammars posited by working linguists.It seems to be a common-enough thought that it would be nice to be able to dissociate the synchronic claims made by generative linguists about the structures made available by adult grammars from the developmental claims made about how these grammars are acquired, specifically with respect to claims about innate ''knowledge of language''.Grammar induction models may have been viewed as playing this role, showing how such grammatical systems can be genuinely learned, without changing the claims about what exactly it is that is learned.The arguments above suggest that this is an implausible strategy.
This, of course, is only an argument against grammar induction models if we assume that generative linguists are indeed identifying genuine properties of the world's languages.It is open to theorists to deny this, and argue instead that the grammars that grammar induction models are likely to arrive at are the correct ones, and thus that human linguistic competence is much more closely related to the linguistic data than generative grammarians have traditionally assumed. 24For the reasons given above, this would likely lead to significantly more variation in underlying grammatical structure than is typically assumed by generative linguists, and would make linguistic explanations significantly more language-specific.However, in making this move, decades of research in linguistic theory is being rejected.It should be recognized that in making this move an obligation is generated to provide an alternative account of the wide range of data, from numerous world languages, and drawing on evidence not available in the learning environment, which is explained by these traditional linguistic theories.Piantadosi argues that Large Language Models, inductive learners par excellence, should be viewed as scientific theories.If this is correct, then they ought to be assessed as such along all the dimensions of theoretical virtue, including things like their explanatory/unifying power.In at least this respect, the ability of traditional linguistic theory to account for data from a range of different languages, as discussed in my case study in Section 6, seems to point heavily in their favor.Scientific progress consists in part by showing how a new approach can do things previous approaches could not, but without discarding the reasons we found the prior approach plausible in the first place.For this reason, it would behoove future work on grammar induction models to take seriously the need to capture the explanatory gains made by theoretical linguists.