The key question in the debate on language evolution is why only humans have developed the amazing capacity to formulate and comprehend highly complex grammatical utterances, despite a very unreliable transmission medium (speech), rampant ungrammaticality, errors, and fragmented utterances. Several viewpoints are currently being researched. Here I take the “evolutionary linguistics” viewpoint, which argues that language is unique among animal communication systems because it evolves in a cultural rather than a genetic fashion, based on linguistic rather than natural selection (Steels, 2012).

It is helpful to look at the development of animal communication systems—more specifically, bird song—to understand more precisely the nature of this uniqueness. Bird song is a valuable model, partly because it is an animal communication system that has some of the properties of human language—in particular, the use of combinatory rules—and partly because it is very well studied.

Bird song has multiple functions: territorial demarcation, kin recognition, attracting a mate (by males), selecting a mate (by females), and location of neighbors. All of these functions are directly relevant for viability and fecundity, and hence impact fitness. There are three ways in which different bird species build and transmit a song system (Grant & Grant, 1996):

1. Many bird species have an entirely innate communication system, as is true for most animal communication systems. An example is the ringed turtledove (Streptopelia risoria), which produces loud repetitive sequences of “kuk-COORRRR-uk.” We find all the characteristics of a biological evolutionary system here. Genes code for the audition and production of specific song traits. Replication takes place through the inheritance of these genes, and hence of these traits. Variation is caused by gene combination, and errors in genetic copying give rise to song variations. Finally, natural selection integrates all the different factors mentioned earlier—namely territorial defense, sexual selection, and so forth—so that a certain type of song becomes dominant in the population.

2. Next, some bird species make use of learning. The role of biological evolution is now to generate neural structures that have the capacity to learn how to extract features of interest from the auditory signal and learn the sensory–motor control programs that reproduce songs. Thus, the songs themselves are no longer genetically coded, and birds are flexible about which songs they can recognize and produce. Auditory song learning is based on unsupervised statistical learning, but is biased by a species-specific auditory filter, and motor control learning is based on template learning, in the sense that the juvenile stores the adult song and then rehearses a song until it approaches the stored song. The relevant learning mechanisms are reasonably well understood from both a computational point of view, through computer simulations and theoretical analysis, and a biological point of view, through lesion studies, neuroimaging, and gene expression. (Fiete, Fee, & Seung, 2007) In many species, the juvenile learns only one song from the father and maintains that song all its life, in which case there is not that much difference from genetic transmission, particularly because selection is still based on fitness.

3. However, as soon as bird songs become transmitted by learning, there is the potential for cultural evolution. And indeed, some bird species exhibit significant and quite rapid change in the song repertoire (Katahira, Suzuki, Kagawa, & Okanoya, 2013). Cultural evolution occurs particularly in open-ended learners, such as the canary (Serinus canaria), among which male birds increase and modify their repertoire throughout life. This is cultural evolution, because replicator dynamics—the core characteristic of an evolutionary system—become operational at the cultural level. The replicating units are the characteristics of songs or song fragments (syllables), or to be more precise, the motor control programs that produce them, as well as the auditory sensory feature detection and pattern recognition used by the song recognition system. Replication takes place by means of social learning. The juvenile male bird acquires the song of his father, or in some cases of a neighbor or other male. Learning preserves the characteristics of the adult song, so there is multiplication with inheritance, but the process is not always exact, and in cases like the canaries, male birds keep introducing variations of their own as adults that may propagate to others. What about selection?

On the one hand, a neutral form of change is clearly going on. The song changes, but this is due to random, neutral substitution and drift (Belzner, Voigt, Catchpole, & Leitner, 2009). In these cases, we cannot really talk about evolution, but rather about simple change. On the other hand, several researchers have put forward possible selectionist effects of bird song change (Williams, Levin, Norris, Newman, & Wheelwright, 2013): (i) Song differentiation leads to better recognition of the individual, the locality, and the social group, which all play roles in mate choice. For example, better kin recognition helps to avoid inbreeding. (ii) Features of the song are used by females to assess the male, occasionally leading to an arms race due to sexual selection. (iii) Songs may culturally adapt to changes in the physiology of the vocal apparatus (in particular, the beak) or to changes in the environment (e.g., low-frequency sounds with slow repetitions carry better in forested habitats). Note, however, that these selection criteria are all related to natural selection. Songs propagate and their characteristic features become more widespread because they help give an adaptive advantage, which increases the frequency of the subpopulation that sings them. The advantage of cultural evolution in this case is a higher adaptation rate than is found with biological evolution, which thus supports more rapid speciation, so that more ecological niches get explored (adaptive radiation).

What about human language? Language is clearly also a culturally evolving system. Even a casual look at historical data shows that the phonetic/phonological system (which concerns speech), the conceptual system (which concerns meaning), and the lexico-grammatical system (which concerns the relation between speech forms and meanings) all change rapidly, and often in profound ways. The historical language record shows that the expression of a particular meaning domain—for example, time (tense–aspect), semantic relations (cases), or information structure (foreground/background)—goes through different modes: (i) Meanings may at first have to be inferred from the context or from domain knowledge. (ii) They may then become expressed explicitly by the introduction of a new word or the expansion of an existing word meaning. (iii) Next, different words may get organized in phrasal structures to express more meanings, including the semantic relations between word meanings. Some of the words incorporated in these phrases grammaticalize, in the sense that they lose some of their original source meaning (a process called semantic bleaching), some of their combinatorial capacity, and some of their form (a process called erosion) (Traugott & Heine, 1991). For example, the word will, which originally was a full verb meaning “to want,” became an auxiliary indicating future tense, and its form in spoken language has been shrinking to ’ll. (iv) Recurrent phrases tend to become routinized patterns that coalesce into complex word forms (Haspelmath, 2011), with inflectional systems and grammatical agreement replacing the phrase structure. (v) And finally, independent phonological change further erodes critical grammatical endings or makes words ambiguous, so that the explicit clear expression of their meaning gets lost. And then the cycle starts again. All these changes did not happen only in the remote past, but mark an ongoing process that fired up at the very beginning of grammatical language and is still active today.

Given that both birdsong and human language undergo cultural evolution, we should ask the question, where the differences lie? After all, the inventory of animal communication systems tends to be extremely small, relative to the 40,000 words and tens of thousands of grammatical constructions that human languages employ. It is true that we also observe drift in human language evolution—for example, certain sounds may shift, the meaning of a word may shift, an existing word may take on new meanings, the ordering of constituents in a phrasal pattern may get reshuffled, and so forth, all for no apparent reason other than catching attention through novelty, fashion, prestige, performance deviations and errors, or language contact. But linguistic change is often not neutral, and can only be understood in terms of linguistic selection. Certain variants spread and get reused because they contribute to more adequate expressive power, improved communicative success, or a minimization of cognitive effort. Moreover, the source of variation is often not random, but is canalized by task demands, constraints on the human sensory–motor system, and the available potential of the cognitive system. Variation thus is also directed by human intelligence, which expands or molds language to fill gaps in expression or avoid excessive effort such as combinatorial search. Let us look at two examples.

The first is from the speech domain. The Latin word vita (“life”) evolved into French vie, with the ending pronounced like the English “e” in me. What were the causal factors behind this change? /i/ and /a/ are produced through vibrations of the vocal chords and different horizontal and vertical positions of the tongue: /i/ requires a more closed, frontal position, and /a/ an open near-back position. /t/ is produced by the tip of the tongue against the palate, cutting off airflow. When producing vita, the speaker has to move the tongue forward for the /i/, up to the palate for the /t/, and then backward for the /a/. In the evolution from colloquial Latin to (modern) French, these movements became optimized. First, vita changed to vida, so that the tongue was already a bit more backward, in the direction of /a/; then vida became vitha (with “th” pronounced as in English the), so that the tongue approached even more the position required for /a/; and then vitha became via, with the extra consonantal movement disappearing altogether. Meanwhile, the /a/ eroded to the schwa, pronounced like the “e” in English the, and then disappeared as well. Clearly, speakers minimize effort by articulating less strictly some of the consonants and vowels, particularly in rapid speech. Variants of word forms are not random, but are canalized by the nature of the articulatory apparatus and the demands of motor control. A particular variant becomes the new norm depending on whether it indeed minimizes effort without compromising auditory recognition, or alternatively, if it disambiguates a word form that has become anonymous through phonetic erosion and optimization.

The second example is from syntax. There is a well-known cross-linguistic development toward the formation and steady expansion of noun phrases. For example, adjectives, which could be free-floating in the sentence in the precursors of Latin, because their relation to the noun was expressed clearly enough using grammatical agreement, were increasingly put strictly adjacent to the noun, creating the first additional slot in a noun phrase. Progressively, other slots appeared, for an article, quantifier, additional adjectives, modifiers, and so forth. So we see in English a growing complexity, as in “slot” > “particular slot” > “a particular slot” > “their own particular slot” > “their own two particular slots,” “their own two particular highly desired slots” > “only their own two particular highly desired slots” > “what is believed to be only their own two particular highly desired slots” (Van de Velde, 2011).

How can we explain this? Again, this is clearly not a case of random variation. When semantically related words are grouped, parsing becomes more localized, and less information needs to be kept in short-term memory. Likewise, when words are placed in slots at particular positions, a pattern becomes more predictable. It becomes a routine, efficient decision for the speaker where to put the word in a sentence, and it becomes more straightforward for the listener to recognize what role the word is semantically playing. Of course, I do not argue here that speakers make conscious language design decisions. Instead, they introduce variation through operations like grouping, word reordering, phonetic optimization, hierarchical structuring, streamlining through analogy, differentiation, pattern generalization, and so forth, building further on what already exists. Those variants that contribute to a “better” communication system then become dominant. This is indeed the most important point: Selection is not natural selection, in the sense of being based on survival and fecundity, but linguistic selection, leading to increased communicative success, adequate expressive power, and reduced cognitive effort. There is also no optimal solution—for example, a simplification for the speaker can mean a complication for the listener. Hence, a language keeps moving around in its space of possible variants, occasionally undergoing rapid change between periods of stasis (Dixon, 1997).

The goal of evolutionary linguistics is not to collect more data; we already have plenty of data, gathered carefully by historical linguists. Rather, the goal is to understand the selectionist factors and cognitive mechanisms at work in language evolution. This will require a fresh look at the mechanisms needed for the production, comprehension, and learning of language (Steels & Szathmáry, 2016), as well as investigations into language dynamics: What are the selectionist criteria, and how do they influence learning and performance? What are the dynamics underlying the competition between linguistic variants, and how are they enacted? What is the effect of population change or intermixing (language contact)? The evolutionary linguistics approach is not only grounded in empirical data from historical linguistics. It also gains helpful data and insights from language typology, which has exposed the enormous variation in human languages; from sociolinguistics, which documents language “on the move”; from psycholinguistic data, which seek to understand language processing; and from computational linguistics, which develops concrete operational models of the processing that can sustain language as a complex adaptive system (Steels, 2012). There is still a lot we do not understand about language, but the evolutionary perspective can act as a glue, bringing disparate research under one umbrella, in the same way that evolutionary biology has unified and refocused all of biology.