Cultural evolution leads to vocal iconicity in an experimental iterated learning task

Experimental and cross-linguistic studies have shown that vocal iconicity is prevalent in words that carry meanings related to SIZE and SHAPE. Although these studies demonstrate the importance of vocal iconicity and reveal the cognitive biases underpinning it, there is less work demonstrating how these biases lead to the evolution of a sound symbolic lexicon in the first place. In this study, we show how words can be shaped by cognitive biases through cultural evolution. Using a simple experimental setup resembling the game telephone, we examined how a single word form changed as it was passed from one participant to the next by a process of immediate iterated learning. About 1,500 naı̈ve participants were recruited online and divided into five condition groups. The participants in the CONTROL-group received no information about the meaning of the word they were about to hear, while the participants in the remaining four groups were informed that the word meant either BIG or SMALL (with the meaning being presented in text), or ROUND or POINTY (with the meaning being presented as a picture). The first participant in a transmission chain was presented with a phonetically diverse word and asked to repeat it. Thereafter, the recording of the repeated word was played for the next participant in the same chain. The sounds of the audio recordings were then transcribed and categorized according to six binary sound parameters. By modelling the proportion of vowels or consonants for each sound parameter, the SMALL-condition showed increases of FRONT UNROUNDED vowels and the POINTY-condition increases of ACUTE consonants. The results show that linguistic transmission is sufficient for vocal iconicity to emerge, which demonstrates the role non-arbitrary associations play in the evolution of language.


Introduction
Languages have iconic structure-that is, some form of resemblance between sound and meaning-woven into the very core of the lexicon (Dingemanse et al. 2015;Blasi et al. 2016;Erben Johansson et al. 2020;Joo 2020). But how does such patterning enter languages and what explains its apparent universality? In this article, we use the experimental iterated learning paradigm to show how the cultural transmission of a single artificial word may converge on iconic sound-meaning

Oppositional vocal iconicity
The number of studies on the genetically and areally independent, (near-)universal, non-arbitrary, and flexible associations between sounds and meanings has grown considerably in recent decades. This type of association is generally referred to as vocal iconicity or motivated sound symbolism (Cuskley and Kirby 2013). Several large cross-linguistic studies (Wichmann et al. 2010;Blasi et al. 2016;Erben Johansson et al. 2020), which in some cases incorporate data from thousands of languages, have identified a number of robust overrepresentations of sounds across languages in basic vocabulary items for concepts that are supposed to be more or less universal to all speakers of all languages (e.g. tree, you, mother, eat, black, small), both culturally and historically (Swadesh 1971;Goddard and Wierzbicka 2002). Collectively, these studies found iconic effects for a wide range of meanings consisting mostly of several fundamental nouns (e.g. ASHES, BREASTS, NOSE, and TONGUE) and verbs (e.g. TO BLOW, TO BITE, and TO SNEEZE), but also a few pronouns (I, WE, and YOU) and adjectives (RED, ROUND, and SMALL).
Experimental evidence has also covered a variety of meanings. For example, Maglio et al., (2014) found that front vowels, as opposed to back vowels, tend to be linked to conceptual precision in fictional city names. Fast speed has also been linked to front vowels and slow speed to back vowels in non-words when asking participants to describe the motion of a ball (Cuskley 2013). Anikin and Johansson (2018); Hamilton-Fletcher et al. (2018), Johansson et al. (2019) found a similar effect for tastes. Participants tend to assign lower F 1 and F 2 frequencies to salty taste samples and higher F 1 and F 2 frequencies to sweet taste samples. Additionally, there is a large body of studies, both experimental and crosslinguistic, on associations between different color parameters, such as lightness, saturation and hue, and acoustic parameters, such as pitch, energy spectrum, vowel formants, and loudness, in great apes, infants, toddlers, adults, synesthetes, non-synesthetes, etc. (Anikin and Johansson 2018;Hamilton-Fletcher et al. 2018;Johansson et al. 2019).
Evidently, vocal iconicity seems to be prevalent in the core of the lexicon; however, with the exception of color, most experimental studies on vocal iconicity have been focused on the SIZE-and SHAPE-dimensions, with the sound side usually conveyed through made-up nonwords. Almost one hundred years ago, Sapir (1929) conducted a study on size-based vocal iconicity which showed that 80% of almost 500 participants preferred to associate a small table with the phonetic form /mil/ and a large table with the form /mal/. Similarly, Kö hler (1929) investigated shape-based vocal iconicity by asking participants to match a round, amoeba-like shape and a pointy, star-like shape with either /takete/ or /baluma/ (later replaced by /maluma/ in his 1947 study). Most of the participants thought that the best fit for the round shape was the word containing voiced sounds and the pointy shape was accordingly paired with the word containing unvoiced sounds. Kö hler's (1929) work was later built on by several scholars (e.g. Rogers and Ross 1975;O'Boyle and Tarte 1980;Lindauer 1990;Holland and Wertheimer 2016;Bross 2018; for a review see Lockwood and Dingemanse (2015), and associations between round shapes and phonetic forms, such as /maluma/ or /bouba/, and associations between pointy shapes and phonetic forms, such as /takete/ or /kiki/, have since be found to hold for around 90% of participants with a wide range of first languages (Styles and Gawne 2017).
Similar studies on words from natural languages have demonstrated iconic effects in a wide range of semantically opposite meanings. Newman (1933) found a correspondence between the articulation and acoustics of vowels and those vowels' perceived size and brightness. This showed that vowels pronounced at the back of the mouth had a lower acoustic frequency, which was also judged to be larger and darker. Johnson (1967) showed that when participants were tasked with coming up with English words to denote small and large size, the vowel quality in the words correlated with the words meanings', ranging from smallest /i/, followed by /e/, /a/ and /u/, to largest /o/. This study was later expanded to include Mandarin Chinese and Thai which also yielded similar results (Huang et al. 1969). Fó nagy (1963) compared /i/ and /u/ in Hungarian and concluded that /i/ was considered quicker, smaller, prettier, friendlier, and harder than /u/, while /u/ was perceived as thicker, hollower, darker, sadder, blunter, more bitter, and stronger than /i/ (in both children and adults). Taylor and Taylor (1962) and Taylor (1963) found iconic effects for big-small, active-passive, warm-cold, and pleasant-unpleasant in four unrelated languages, and Gebels (1969) found that speakers of five different languages could correctly match the meaning of sensory words from the other languages above chance level.
Perhaps, the most widely known type of vocal iconicity is onomatopoeia (i.e. human imitations of realworld sounds with varyi ng similarity to the source sound), which has been referred to as imagic, absolute or imitative iconicity (Hinton et al. 1994;Dingemanse 2011;Carling and Johansson 2014;D'Onofrio 2014;Dingemanse et al. 2015). For example, the English word cuckoo is a direct imitation of the calls produced by the cuckoo but produced through the filter of the human vocal apparatus and linguistic sound system. However, in contrast to onomatopoeia, the type of vocal iconicity usually investigated experimentally involves referents that are based on senses other than hearing, such as size, shape, deixis, or color, and can in most cases be classified as relative or word-relational diagrammatic iconicity. Relative iconicity is constructed by mapping semantic contrasts to phonetic contrasts which are somehow similar to each other. This usually includes binary semantic meanings that can easily be placed in opposition to each other (FAST-SLOW, BIG-SMALL, ROUND-POINTY, etc.) and phonetic attributes that can be perceived to belong to a gradable scale (e.g. voicing, quality, quantity, tone, volume, etc.). For example, if SMALL is mapped to front unrounded vowels and BIG is mapped to back rounded vowels, these parallel sound-meaning associations add relations between the semantic and phonetic parameters to the internal relations within the semantic parameter SIZE (between BIG and SMALL) and the phonetic parameter roundedness (between unrounded and rounded vowels). Ohala (1994) argues that the socalled frequency code (see also Rendall et al. 2005) could be the underlying mechanism responsible for associations of this type. It states that since the size of the resonance chamber of an animal dictates the fundamental frequency of that animal's vocalizations, the sounds that the animal produces can be utilized in various ways to evoke properties such as size. This works according to the same principle as erecting feathers or fur in threatening situations to seem larger or cowering when wanting to submit. Ohala therefore argues that most animals, and maybe humans, perceive low and/or falling fundamental frequencies of vocalizations such as growling as large, authoritative, confident, dominant, or distant, and high and/or rising fundamental frequencies of vocalizations, such as whining as small, polite, questioning, dependent, or near.

Confounds in vocal iconicity experiments
Based on these studies, sounds associated with meanings belonging to the SIZE and SHAPE-domains, which are the focus of the current study, can be summed up into a few general groups. The meaning SMALL has been reliably associated with voiceless consonants and vowels with a low first formant or high second formant (e.g. [i] or [a]) while the meaning BIG has been associated with voiced consonants and vowels with a high first formant or low second formant (e.g. [u] or [A] and round shapes might primarily be attributed to the roundedness of the vowel, but it is possible that other parameters, e.g. vowel height and backness, play a role as well. Furthermore, despite the large number of studies that have found supporting evidence for the bouba-kiki effect, there are two reported cases where the effect has failed (Rogers and Ross 1975;Styles and Gawne 2017). Both of these were, however, conducted with participants speaking languages in which the stimuli words were potentially not phonologically possible. This could have led to issues with parsing the linguistic strings, which, in turn, could also result in a breakdown in the mapping between sounds and meanings. This raises questions about the strength of the bouba-kiki effect as well as the influence of language-specific phonological makeup. Thus, while binary distinctions between sounds ought to be useful for studying binary semantic pairs, they should be broad enough to accommodate speakers with different first languages, as well as capture individual parameters which are also articulatory, acoustically, and iconically relevant. Vowels can be primarily divided according to height, backness and roundedness, which also loosely correspond to the first three formants, F 1 , F 2 , and F 3 , and thus, cover most of the variation used for distinguishing vowel segments across languages (Ladefoged 2001: 32-36). Furthermore, energy level differences in F 1 and F 2 have been iconically linked to size, distance, dominance, etc., while the roundedness of F 3 has been linked to shape. Consonants are considerably more articulatorily diverse than vowels, but can generally be divided according to voicing, manner of articulation and position of articulation. The distinction between voiceless and voiced sounds is self-explanatory and cuts through all consonants and is used phonemically in most languages (Ladefoged 2001: 63-65;Ladefoged and Maddieson 1996: 44-46). In addition, it is, like F 1 and F 2 , iconically associated with a number of meanings, such as size (Ohala 1994). Manner of articulation include a wide variety of sound types, but one of the most fundamental ways of classifying consonants is to distinguish between sonorants and obstruents. This distinction has also been employed for several iconicity experiments because the contrast between sonorants' continuous, non-turbulent airflow, and obstruents' obstructed airflow could iconically evoke, for example, noisiness versus smoothness or other related meanings. Position of articulation also includes a range of different sounds which can be difficult to fit in a binary distinction without basing the distinction on a specific marked feature and grouping the remaining features in a contrastive group. However, the distinction between grave and acute sounds (Jakobson et al. 1953) differentiates between perceptually sharper and duller sounds which has also been linked to iconic associations (Lapolla 1994). Grave sounds include consonants produced by using soft tissue secondary articulators, notably the lips and the area from the soft palate and back, while acute sounds include consonants produced using the hard palate as a secondary articulator.
Related to the binary division of sounds, the inherent structure of relative diagrammatic iconicity in which two poles of phonetic and semantic parameters are mapped in parallel have led researchers to create stimuli consisting of premade non-words. However, this methodological setup does not explore which sounds are actually relevant for identifying iconic mappings and might in some cases yield incorrect information about the strength of these iconic effects. Nielsen and Rendall (2012) conducted a learning experiment in which one group of participants was taught to combine non-words with iconically congruent figures and one group was taught to combine non-words with iconically incongruent figures, after which all participants were subjected to random single word-figure combinations and asked to judge the combinations as "correct" or not. The results revealed that while the participants in the incongruent group performed at chance level, the congruent group performed only modestly (53.3% correct). This therefore suggests that the iconic bias might be weaker than demonstrated by previous studies and that the forced choice paradigm could inflate weak effects (Dingemanse et al. 2015).
Furthermore, premade non-words contain both vowels and consonants, but studies have shown that the effects of vowels and consonants might differ in iconic strength. Ahlner and Zlatev (2010) investigated the bouba-kiki effect by selecting vowels and consonants that had been reported to contrast iconically and then created four sets of non-word types. Two of these types were iconically congruent (e.g.
[titi] for POINTY and [mumu] for ROUND), while the other word types were iconically incongruent (e.g. e.g. [tutu] for POINTY and [mimi] for ROUND). Participants were then asked to match these words to a pointy or round shape, virtually identical to those used by Ramachandran and Hubbard (2001). The results showed a clear preference for the iconically congruent words, and this effect was stronger in the words with congruent consonants, which might indicate that consonants play a more important role in this iconic mapping. Similarly, Nielsen and Rendall (2013) demonstrated a stronger preference for plosives and unrounded vowels in pointy figures, as opposed to sonorants and rounded vowels for the round figures.
Finally, another issue in this type of experiment is orthography, since the vast majority of tested participants are from literate societies. This could result in the shapes of letters in text stimuli having an effect on responses. For example, Nielsen and Rendall (2011) investigated the bouba-kiki effect using stimuli non-words conveyed through text in which they found iconic effects for consonants but not vowels. The experiment was then rerun using auditory stimuli through a text-to-speech synthesizer to exclude any orthographic bias, and they found the same general pattern regarding consonants and vowels, although the overall effect was weaker than the text stimuli. Furthermore, Cuskley et al. (2017) found that orthography seems to be a major confounding factor for associations between sounds and shapes-the voiced/ voiceless distinction, for example, which they argue had not been sufficiently controlled for in most previous studies. By both testing how well literate participants matched abstract shapes to non-words in written form along with spoken representations, and how well they matched the shapes to purely auditory non-words, Cuskley et al. showed that the curvature of letters can significantly influence the perceived roundedness of shapes in sound-shape associations.
However, Hamilton-Fletcher et al. (2018) showed that these types of correspondences might be more complex. While pitch-shape correspondences required visual experience to emerge in blind participants, pitch-size and pitch-weight were found to be unaffected by visual experience, and pitch-texture and pitch-softness even seemed to emerge or grow stronger with blindness. Thus, visual experience cannot solely explain why people with limited multisensory interactions have multimodal perception. Instead, this could be attributed to other factors such as neuroplasticity.

Vocal iconicity through iterated learning
Some general conclusions can be drawn from the different approaches that the studies we have reviewed have employed. In the bouba-kiki effect, both vowels and consonants seem to play a role, which illustrates the value of thoroughly investigating how different sounds are mapped to different meanings. Furthermore, previous studies, with some notable exceptions (e.g. Jones et al. 2014;Tamariz et al. 2017 , described below), have typically relied on experimental paradigms in which participants are asked to associate meanings with a set of non-words or syllables that are predefined. This means that while the bouba-kiki effect seems to be more or less universal, it is also subjective in nature, given that each individual participant is asked to combine meanings with sounds that may or may not adequately fit his or her intuition or phonology. We therefore wanted to investigate the cognitive biases that lie at the core of vocal iconicity by using a methodological approach that focuses on the transmission of vocal iconicity through the language filters of participants with a wide range of native languages, but which also excludes orthographic influence as much as possible. This approach would then allow us to get a more holistic picture of the boubakiki effect by revealing differences to the results of previous studies.
One way of achieving this is to use methods that are designed to study how systems, for example languages, change over time, such as the iterated learning paradigm (Kirby 2001;Kirby and Hurford 2002;Simon Kirby et al. 2008, 2015. In iterated learning studies, some form of information, such as words, music, or drawings, is transmitted from one participant to another, with the learner at generation i producing behavior that is input to the learner at generation i þ 1. Together, several generations of such learners form a 'transmission chain'. At its core, the iterated learning paradigm is reliant on the fact that information tends to be lost during the transmission process (Spike et al. 2017), causing the object of study to change in ways that reflect the learner's cognitive biases, whatever those biases happen to be, and the dynamics involved in the particular transmission channel used. For example, (Canini et al. 2014) have shown how category learning biases can emerge naturally through an iterated learning study. In this way, iterated learning experiments can be used as a technique to uncover the cognitive biases of participants.
However, to date, only a few studies have investigated the emergence of vocal iconicity through iterated learning. Jones et al. (2014) trained participants on miniature languages that consisted of pairings between various round and pointy shapes and written labels which were rated as sounding iconically neutral by English monolinguals. The participants then had to type the label learnt for each shape, including shapes they had not previously been trained on, and these labels were passed on to the next participant. Jones et al. found that iconic labels emerged to express round shapes but not pointy ones. When the participants then had to match labels that were judged as either iconically round, pointy or neutral to one of two shapes, they again only found an effect for the round shapes, which therefore suggested that the driving force behind this type of iconic mapping is the lip shape involved when producing round sounds rather than a cross-modal diagrammatic mapping. Tamariz et al. (2017) conducted a similar study in which participants were assigned to one of two conditions. The first condition was a standard iterated learning design, as described above: participants had to learn the mapping between written non-words and meanings (spiky and round figures) and this mapping was then taught to a new participant, and so forth. In the second condition, there were two participants in each generation who used the words to communicate with each other. The authors found that the emergent words were rated as more pointy under the communicative condition, suggesting that the process of communicating with others contributes to stronger iconicity effects. Carr et al. (2017) also found that iconic patterning can emerge through iterated learning. In their experiments, participants had to learn words (presented in both written and auditory form) for randomly generated triangles. Although the study was not designed to investigate vocal iconicity directly, the authors nevertheless noted that thinner triangles tended to be labelled by sounds listed as 'pointy' by Ahlner and Zlatev (2010) (e.g. /k/, /i/, /t/), while more equilateral triangles tended to be labelled using sounds listed as "round" (e.g. /b/, /m/, /u/). They found this effect under both a standard iterated learning design and a design in which participants had to communicate. Furthermore, Edmiston et al. (2018) showed that when environmental sounds, such as breaking glass or splashing water, are imitated through iterated learning, they become more stable and word-like, resembling ideophones. The final forms of the imitations could be matched to the source sounds above chance. Likewise, when people are asked to make up novel vocalizations for basic vocabulary words, naïve listeners are able to infer what they mean based on their phonetic forms (Parise and Pavani 2011;Perlman et al. 2015;Perlman and Lupyan 2018).
Based on previous studies, we know that the meaning pairs SMALL-BIG and POINTY-ROUND have been found to be consistently associated with sounds. These studies have also shown that a binary distinction between different types of sounds seems to be beneficial for studying oppositional vocal iconicity. We do not, however, know exactly which phonetic parameters are relevant for understanding iconicity in the SIZE-and SHAPE-domains, including broad categories such as vowels and consonants, since the premade stimuli words previously used have generally included a mix of sounds which belong to several different parameters. Hence, in order to understand how iconic associations emerge and which sounds and meanings are driving these effects, this study adopts a new approach for studying these phenomena using immediate iterated learning, which also bypasses the forced choice paradigm. Thus, this study adopts an explorative approach with the aim to reveal iconic correspondences. However, based on evidence from the large body of previous studies on sound-meanings associations in the SHAPE-and SIZE-domains, we could make some general assumptions. This included that 1, the meanings SMALL and POINTY, could result in words with a larger share of high or front vowels and consonants with highfrequency energy accumulation than the meanings BIG and ROUND, and 2, the meaning ROUND could result in words with a larger share of rounded vowels and labial consonants than the meaning POINTY.

Method
The methodological setup we used is relatively simple. The participants were divided into five conditions (CONTROL, BIG, SMALL, ROUND, and POINTY) and were presented with a recording of a single seed word, which includes a wide range of typologically common segments, and were asked to repeat it. These repetitions uttered by the participants were recorded and then used as stimuli for the next participant in the same transmission chain. This process was then repeated for 15 generations per transmission chain. In the CONTROL-condition, the word was simply passed down the 15 generations, but in the other conditions the participants were primed with a meaning. The overall paradigm is illustrated in Fig. 1.

Participants
Participants were recruited online via the Figure Eight crowdsourcing platform which made it possible to include participants from several countries and with a range of different first languages. The participants were prevented from participating in the experiment more than once by identifying themselves with their unique worker IDs. The aim of the study was to include 15 generations (participants) per transmission chain and 20 transmission chains for each of the five conditions, for a total of 1,500 unique participants. To achieve this, we recruited 2,854 participants, 1,354 were of whom had to be excluded for one or more of the following reasons: 1, Misunderstanding the task, such as repeating the meaning ('big', 'small') rather than the word or asking a question about the task; 2, providing recordings of low quality (e.g. lack of sound, interfering background noise or recordings in which there were no recognizable sounds from the previous generation); or 3, providing recordings with obvious lexical interference, such as mistaking the presented audio as a word or phrase in a real language. The CONTROL-condition required 554 participants to yield 300 usable recordings, the BIG-condition required 592, the SMALL-condition required 591, the ROUND-condition required 565, and the POINTY-condition required 552. The participants were paid 50 cent USD for completing the task, which took around two minutes, and the study was conducted under established ethical standards approved by the Linguistics and English Language Ethics committee at the University of Edinburgh.

Stimuli
Of the five conditions, four were designed to prime the participants with a meaning by including either of the semantically oppositional poles of the SIZE-domain (BIG- SMALL) or the SHAPE-domain (ROUND-POINTY). The meanings for the BIG-and SMALL-conditions were conveyed in text since stimuli based on illustrations would require comparison in order to convey the correct meaning. The participants were either presented with the sentence 'The word you are going to hear means big' in the BIGcondition or with 'The word you are going to hear means small' in the SMALL-condition. The biases for the ROUND-and POINTY-condition were conveyed through shapes presented visually together with the sentence 'The word you are going to hear means', as shown in Fig. 2. In the CONTROL-condition participants were not primed with a meaning.
All transmission chains were initialized with the same single seed word (i.e. the same audio stimulus was presented to the first participant in every transmission chain). This was to make it as easy as possible to track the development of sounds and groups of sounds over generations and for easier comparison across conditions. To allow for a variety of different potential iconic strategies to emerge, we designed the seed word to include a typologically, acoustically, and articulatory varied selection of segments.
It is difficult to ensure that each speech sound that a word contains will not result in any kind of semantic association for all speakers. This is not because all segments are iconic, but rather because of lexical transfer as a result of segments' varying occurrence in words across languages. In addition, there are also associations which could stem from the idiosyncratic salience that certain segments might have in an individual speaker's mental lexicon. We have neither a comprehensive overview of all iconic mappings between sounds and meanings utilized throughout human languages, nor a list of potential language-specific or individual soundmeaning patterns. Thus, we need a seed word that is located at the center of cross-linguistic phonological space to allow for sound changes in any direction across generations of participants, especially in regard to speakers' different perception of speech sounds due to language-specific phonological systems. Furthermore, this word has to accommodate a reasonable mutation rate (i.e. to ensure that the seed word can evolve phonetically, it should be somewhat difficult to remember). If the word were too easily learned, the participants would be able to repeat it perfectly and there would hence be no space for evolution to operate in.
The seed word was designed to consist of three syllables. The sounds were selected to be present in more than half of the phonologies of the world's languages ([r] is the most common vibrant but was only found in 44% of the phonologies) (Mielke 2017;Moran and McCloy 2019) since the initial seed word was assumed to adapt to the participants' phonologies quickly which would leave the use of uncommon sounds for increasing mutation rates unnecessary. Long versions of the three most extreme vowels, [i : ], [a : ], and [u : ], were included, and the seven featured consonants were selected to be evenly distributed across manners and positions of articulation, as shown in Table 1.
Approximately the same number of voiceless and voiced consonants was used in the word and consonant clusters were designed to include both voiced and voiceless sounds. In addition, the voiceless consonants were placed in the same syllables as the vowels with lower F 2 , Figure 1. Illustration of the experimental procedure for the five conditions. The first-generation participants (G1) are exposed to their condition-specific visual stimuli and then to the seed word. They then repeat the word and their production was, in turn, used as the audio stimulus for the subsequent generation in the same transmission chain. This process was iterated until all chains had successfully transmitted the evolving string of sounds through 15 participants. [u] and [a], and the voiced grave (Jakobson et al. 1953) consonants in the same syllable as the vowel with the lowest F 2 , [i], to distribute the general spectral energy throughout the entire word. The selected parameters resulted in the word form [gi: mpra: lhu: s] which was then recorded by a female native speaker of Czech with an academic background in linguistics to ensure a phonetically accurate pronunciation of the word. The selected segments of the seed word are present in, on average, 76% of the 2155 phonologies available in the cross-linguistic phonological inventory repository PHOIBLE (

Procedure
The task began with the following general instructions: 'In this task you will hear a word in an "alien" language. We will also tell you the meaning of the word. Your task is to listen carefully to the word and repeat it into your microphone. Make sure your speakers or headphones are switched on and the volume is turned up. First, we will tell you the meaning of the word. Then you will hear the word. There will then be a 3-second pause. Finally, you must repeat the word into the microphone.' Participants in the CONTROL-condition, however, were not told that they would be presented with the meaning of the word.
Next, the participants entering the ROUND-and POINTY-conditions were presented with the round and pointy shapes. Those entering the BIG-and SMALL-conditions were presented with the text stimuli and were then required to confirm that they read the text properly by typing 'big' or 'small' depending on condition in order to continue with the task. This was included to make sure that the participants actually actively read the text stimuli since these could be easily overlooked as compared with the shape stimuli. This step was skipped for the participants in the CONTROL-condition who instead proceeded directly to the listening and production steps. The first participant in each transmission chain listened once to the constructed seed word, which was followed by a 3-s pause after which they had to repeat what they heard into their microphone. After completing the task, the participants were asked what they thought the word meant along with a few background questions (native and other languages). The utterance that the participant recorded was then uploaded to our server. All recorded stimuli were manually checked by the experimenter. Often it was also necessary to normalize the volume to a consistent level and/or trim the recording to only include the actual utterance. The recorded utterance was then used as the stimulus for the next participant in the same transmission chain.

Data analysis
After data collection was completed (audio files can be accessed at https://osf.io/y3eru/, along with other supplementary material (see online), Appendix 4 and Appendix 5), the audio recordings were transcribed into the International Phonetic Alphabet (Appendix 1). This was done by the first author who was blind to the conditions the data belonged to. Tones, stress or phonemic length were not taken into consideration for the analysis as they seldom are transmitted correctly when speakers from different languages attempted to pronounce utterances with these features. Diphthongs, triphthongs, affricates and coarticulations were divided into their components and analyzed as separate segments for comparability reasons.
The transcribed sounds were then categorized according to six binary sound parameters, three for vowels and three for consonants (Appendix 5). Vowels were divided into HIGH and LOW, FRONT and BACK, and ROUNDED and UNROUNDED, while consonants were divided into GRAVE and ACUTE, VOICED and VOICELESS, and SONORANT and OBSTRUENT (see Table 2). The HIGH-group included high, near-high, high-mid and true-mid vowels (including [@]), while the remaining vowels were assigned to the LOW-group. The FRONT-group included front and near-front vowels and the BACK-group included central, including [@], near-back and back vowels. The ROUNDED-group included all rounded vowels and UNROUNDED-group unrounded vowels. Likewise, the VOICELESS-group included all voiceless consonants, VOICED-group all voiced consonants, the SONORANT-group all sonorant consonants and the OBSTRUENT-group all obstruent consonants. Finally, the GRAVE-group included bilabials through linguolabials, as well as velars through glottals, and the ACUTE-group included dentals through palatals.

Statistical model
We modeled the proportion of vowels or consonants of each particular sound parameter (HIGH-LOW, FRONT-BACK,

ROUNDED-UNROUNDED, GRAVE-ACUTE, VOICED-VOICELESS,
SONORANT-OBSTRUENT) out of the total number of vowels or consonants in the word for generation 0 (seed word) through 15. Proportions rather than absolute values were chosen in order to compensate for reduplication and word length effects. The proportions were calculated separately for vowels and consonants since it is possible that some transmission chains might utilize the former iconically, while others might utilize the latter. For example, if an association is found between a meaning and high frequency sounds, the sound could be voiceless consonants, front unrounded vowels, or both. Thus, a phonetic form such as [tuta] was analyzed as 100% [t] in terms of its consonants, and 50% [a] and 50% [u] in terms of its vowels. We then used binomial mixed models with generation and condition as predictors, with an interaction. One such model was fit for each of the six sound parameters. To account for the non-independent nature of observations from the same chain, we included chain as a random intercept. This may mitigate the problem of autocorrelation of residuals from adjacent observations. To minimize the risk of overfitting, we imposed their conservative shrinkage to zero with the horseshoe prior (Carvalho et al. 2009). The models were fit using R 3.6.0 (R Core Team 2020) and the package brms version 2.9.0 (Bü rkner 2017). We first modeled the changes in proportion of each sound parameter and condition, including the CONTROL-condition. We then also compared the changes of proportions for each of the stimuli-conditions to the changes of proportions of the CONTROL-condition.

General results
In total, the participants reported 58 different first languages which can be found in Appendix 4. Two thirds of all participants reported one of the five most common languages: Spanish (485), English (223), Serbo-Croatian (104), Russian (99), and Arabic (88). On average, the original 10 segments (3 vowels, 7 consonants) of the seed word were reduced by approximately 3 at generation 15, as seen in Fig. 3. The reduction of total word length was mainly caused by the loss of consonants, which at generation 15 were reduced from the original 7 to approximately 4. The vowels, on the other hand, were only reduced by about a quarter of a segment on average. It is quite possible that the reason for these differences between consonants and vowels could be attributed to a general articulatory preference for simple syllable structures, see for example the example transmission chains in Table 3.
First, we tested whether we could find any noteworthy over-or underrepresentations of the sound groups when comparing the seed word to the generation 15 words within each condition. Since all sounds groups were constructed in pairs, an overrepresentation of, for example, rounded vowels would correspondingly also result in an underrepresentation of unrounded vowels. All conditions, except BIG, showed noteworthy changes for at least two of the investigated sound parameters (see Fig. 4 and Appendix 2). However, the HIGH-LOW and SONORANT-OBSTRUENT parameters did not produce any noteworthy changes. The proportion of FRONT vowels  , n,˛, b, d, g, v, z, w, l, r, j Manner SONORANT m, n,˛, w, l, r, j OBSTRUENT p, t, k, b, d, g, f, s, h, v, z Position GRAVE m,˛, p, k, b, g, f, h, v, w ACUTE n, t, d, s, z, l, r . Lastly, the proportion of VOICED consonants increased slightly in 13]).
Secondly, we compared the sound distributions for each stimuli condition to the CONTROL-condition (see Fig. 5 and Appendix 3). This crystalized the results and became easier to interpret. There were two cases for which  In addition, a weaker yet noteworthy effect was found for the proportion of GRAVE consonants which decreased in the POINTY-condition by À8.7% [À16.6, À0.5] when compared to the CONTROL-condition. Furthermore, as shown in Fig. 6, the noteworthy changes compared to the CONTROL-condition started taking off around generation 6 and gradually increased, which can be seen most clearly in the roundedunrounded parameter. This could be attributed to word length and syllabic complexity which might have distracted the participants from the text and shape stimuli. The average decrease of word length (four sounds) was most prominent in the early generations; by generation 6 the word lengths had decreased by three sounds and by generation 15 the word lengths had only decreased by one additional sound. Thus, the stimuli words would have had to become simplified before iconicity could start affecting the sounds in the words. However, this suggests that it is possible that even stronger effects might be observed over longer transmission chains (cf. Tamariz et al. 2017).

Discussion
The purpose of this study was to investigate how iconic associations emerge and which sounds and meanings are involved by observing how a single seed word was altered by being transmitted between language users in five different conditions.

General discussion
The most important results were yielded by comparisons between the CONTROL-condition and the stimuli conditions, iconic effects were found for both vowels and consonants. This is also in line with other studies that have shown that both vowels and consonants are involved in size and shape iconicity (Ahlner and Zlatev 2010;Nielsen and Rendall 2013;D'Onofrio 2014). The clearest results were produced by the SMALL-condition and showed a preference for FRONT and UNROUNDED vowels and a dispreference for BACK and ROUNDED vowels. The preferred sounds were typically represented by [ Ohala's (1994) frequency code which predicts that smallness, as well as related meanings, are evoked by high and/or rising frequencies of vocalizations. Furthermore, a plethora of cross-linguistic and experimental studies have found similar associations between size and energy level or pitch (Sapir 1929;Newman 1933;Taylor and Taylor 1962;Fó nagy 1963;Taylor 1963). For example, Erben Johansson et al. (2020) found SMALL and SHORT to be associated with voiceless consonants, which of course also involve high frequency energy (Ohala 1994). Consequently, this association should probably be regarded as one of the most robust iconicity effects found since it aligns with solid typological and experimental evidence.
The most surprising result was the decrease of GRAVE consonants, and the corresponding increase of ACUTE consonants in the POINTY-condition, since one of the most common GRAVE consonants, [k], is often featured in pointy stimuli words (e.g. [kiki]). The results do, however, align with the idea that consonants might play a somewhat larger role than vowels in shaping vocal iconicity (Nielsen and Rendall 2011;Fort et al. 2015). This does not necessarily mean that [k] is confirmed to be disfavored when paired with pointy shapes, since the sound group also contains labial and voiced consonants. Nevertheless, this has some implications for bouba-kiki tasks since it demonstrates that using ready-made stimuli words for experiments such as this might not always produce accurate effects (Dingemanse et al. 2015).
These findings also suggest a slightly more complex mapping between sound and meaning than pitch-to-size. Although no effect was found for the VOICED-VOICELESS parameter, ACUTE sounds do generally involve higher frequency energy than grave sounds, but this sound group included both voiceless and voiced sounds which is the primary consonantal distinction between high and low frequency energy. It is thus possible that the sharpness produced by ACUTE sounds are more fundamental than the overall energy differences between VOICED and VOICELESS sounds (see also Aryani et al. 2020). Consequently, there might also be a potential discrepancy between associated phonetic parameters and semantic domains. As one pole of the continuous SIZEdomain, the SMALL-condition can be clearly linked to the equally continuous frequency scale, but the sounds mapped to POINTY-condition could be, at least partially,  based on some other, potentially dichotomous, type of mapping. The preference for these different types of associations could be grounded in the semantic features of the stimuli as BIG and SMALL are rather abstract and require comparison in order to be defined which is a good fit for degrees of pitch. ROUND and POINTY are considerably more visually concrete and their contrasting geometrical features could also be used to tell them apart from shapes such as squares or ellipses. Accordingly, the sounds associated with POINTY could portray similar concreteness. Alternatively, the differences in presentation between the text stimuli in the SMALL-and BIG-conditions and shape stimuli in the POINTY-and ROUND-conditions might have resulted in the conditions not being completely comparable due to different confounding factors. Nevertheless, it is unlikely that this would completely explain why vowels were found to be associated with SMALL and consonants with POINTY. It is also possible that there could be a specific orthographic confound in the SMALL-and BIG-conditions (Cuskley et al. 2017). While no effect was found for the BIG-condition when compared to the CONTROL-condition, the SMALL-condition did include an/a/which represents an unrounded front vowel in many languages and could have led to an increase of similar sounds in the results. However, in the English pronunciation of the word small, /a/ represents /O/, a rounded back vowel which could indicate that this effect would be rather modest.
Another interesting finding was that, when compared to the CONTROL-condition, the SMALL-and POINTY-conditions produced iconic effects while the BIG-and ROUNDconditions did not. It is difficult to know exactly why only one of the poles of these semantic parameters, but similar results have been found in previous studies although both for pointy and for round shapes (Nielsen and Rendall 2011;Jones et al. 2014;Tamariz et al. 2017;Fort et al. 2018). There is also evidence for that antonyms and semantically oppositional concepts are cognitively closely related (Deese 1965;Justeson and Katz 1991;Willners 2001;Paradis et al. 2009), that poles of the same semantic dimension differ in their iconic predictability (Westbury et al. 2018) and that iconic relationships between concepts can be upheld in reversed order (Johansson and Carling 2015). This could therefore indicate that these types of concepts are iconically coded pairwise.
Finally, it should also be pointed out that the CONTROL-condition produced two noteworthy changes and all stimuli conditions produced decreases in GRAVE consonants which illustrates the difficulty with designing the seed word. However, the decrease in front vowels and increase in rounded vowels in the CONTROL-condition are in fact mirror images of each other since front vowels are usually unrounded. Furthermore, both of these effects are found in the experimental conditions as well, with the notable exception of the SMALL-condition. In addition, to minimize the risk of finding effects by chance, we controlled for multiple comparisons by imposing a conservative shrinkage prior (see Section 2.5). This suggests that these changes could be interpreted as a stabilization toward a kind of typological and/or articulatory default. Furthermore, it could be assumed that the proportions of iconic sounds would increase indefinitely until the transmitted words would consist only of front unrounded vowels. This is, however, unlikely for a number of reasons since linguistic material from various sources is dynamically introduced into words as languages change over time. First, words, except for a very small number per language, generally adhere to phonotactic restrictions that require them to include both vowels and consonants. This is because there simply are not enough unique individual phonemes in languages to be assigned to all meanings that need to be conveyed. And secondly, many languages require all words, including loans, to have affixes attached to them in order to be grammatical. Similarly, the participants included in the present study were also instructed to repeat what they heard which forced them to retain considerable parts of the syllable structure and sounds from the previous utterance.
4.3 What is required for iconicity to emerge? Jones et al. (2014) showed that iconicity can emerge through transmission. However, as with most previous experiments that have investigated iconicity, the participants were highly restricted due to the use of text-based artificial languages or forced-choice experimental design. While accompanied by the same methodological restrictions, Tamariz et al. (2017) only found that iconicity emerges through communicative interaction and not through individual reproduction. The stronger effect that interaction brings to the table was attributed to an increased number of possible innovations that could increase iconicity as well as a larger number of possible adopters of the signal, which increases the chance of labels fitting with meanings in a speech community. Therefore, Tamariz et al. argue that their results can be interpreted as evidence for random mutation and selection rather than guided variation; in other words, cultural traits acquired by a population through individual learning drive cultural evolutionary processes. Furthermore, the overall reason or reasons for how iconicity emerges over time is harder to tease apart and outside of this article's scope. However, it can be assumed that, in the present study, either the participants' memories were affected causing them to misremember a more iconic version of the stimulus, or their perceptions were affected causing them to perceive a more iconic version of the stimulus. Alternatively, this could also entail a combination of both of these, or more, factors, such as recordings being muffled which created vaguer stimuli that would give rise to more change and assimilation of the phonemes of the stimuli words into the native phonology of the participants. This could therefore somewhat limit the generalizability of the present study.
Nonetheless, it would be unwise to underestimate the role of transmission in the dynamics of iconicity. First, both Jones et al. (2014) and the present study showed that transmission alone is enough for iconic effects to arise. Secondly, the present study further suggests that very little is required in order for iconicity to emerge (Edmiston et al. 2018). Even without interaction between participants, constrained experimental setups, forced choice questions, premade stimuli words or using text as a proxy for spoken language, all of which could in some manner increase the likelihood of mapping sound to meanings correctly outside of the bouba-kiki effect (Cuskley et al. 2017), iconic effects seem to have emerged. Thirdly, there is overwhelming evidence that iconic forms, including language-specific ideophones, facilitate language learning and comprehension in both children and adults (Imai et al. 2008;Nygaard et al. 2009;Kantartzis et al. 2011;Imai and Kita 2014;Lockwood et al. 2016a,b;Massaro and Perlman 2017).
However, while iconicity and synesthetic crossmodal mappings are present in the early stages of human ontogenetic development (Mondloch and Maurer 2004;Maurer et al. 2006;Walker et al. 2010) and go at least as far back as to the ancestor we share with chimpanzees (Ludwig et al. 2011;, a recent study failed to find a bouba/kiki effect in great apes (Margiotoudi et al. 2019). In addition, these mappings do not seem to disappear but as language competence and vocabulary size increases with age, the share of iconicity in the lexicon is lower for adults as compared to children (Ludwig and Simner 2013;Perry et al. 2015;Massaro and Perlman 2017). The likely explanation for this is that iconicity does not scale well in language. In a less developed and lexically poor language, iconicity can aid in intuitively linking words to fundamental meanings, but as languages adapt to the expressive needs of their users, the number of distinctions that must be made cannot be handled by an iconic system. Thus, here is where iconicity falls short, as there simply are not enough unique iconic signals (either through sounds or gestures) available to accommodate the diversity of meanings that language users might wish to express (Gasser 2004;Westbury et al. 2018).
Nevertheless, iconicity is still found in complex languages and seems to permeate many sections of the lexicon (Sidhu et al. 2019), although it excels in specific functions in conjunction with arbitrary and systematic mappings between sound and meaning (Monaghan et al. 2011;Dingemanse et al. 2015). However, agents without advanced language competence, such as great apes, do utilize iconicity even though they have very limited access to interactional language, which suggests that the transmission of signals is enough to facilitate iconicity. Furthermore, large-scale cross-linguistic studies on lexical iconicity show that iconic forms are present throughout languages and language families, but also that the same sound-meaning associations are not found everywhere at the same time (Wichmann et al. 2010;Blasi et al. 2016;Erben Johansson et al. 2020;Joo 2020), which suggests that iconicity is in a perpetual process of decay and rebuilding (Johansson and Carling 2015) and not conserved through time (Pagel et al. 2013). In sum, certain iconic associations between speech sounds and semantic features seem to affect the formation of lexemes in human language and while interaction could provide an even more advantageous environment for non-arbitrary associations, this study suggests that interaction might not be a prerequisite for iconicity.

Conclusion
We have shown that by adopting an iterated learning approach for investigating the classic bouba-kiki and mil-mal experiments, as well as including a much larger number of participants than previous studies, it was possible to see iconic effects emerge. By using a simple methodological setup which included an auditorily modest linguistic environment without premade stimuli words or task training, we were able to get a deeper understanding of how vocal iconicity operates within the semantic SIZEand SHAPE-domains. Not only were these results aligned with the sound-meaning associations found in large-scale cross-linguistic and experimental studies, but one of the effects gradually strengthened with generation as well, which indicates that stronger effects might be observed with longer transmission chains. Furthermore, only the SMALL-and POINTY-conditions produced iconic effects, while the BIG-and ROUND-conditions did not, which could indicate that iconic effects do not have to be equal in concepts belonging to the same semantically oppositional pair. In addition, while the sounds associated with the SMALL-condition could be linked to differing degrees of frequency, the sounds associated with the POINTY-condition could indicate that another factor serves as a foundation for this mapping and should be investigated further. In sum, these results indicate that linguistic transmission through disconnected language users is enough to investigate cognitive biases for vocal iconicity, which can easily be expanded to a range of iconically promising meanings, for example TALL, LONG, MANY, etc., and is of importance for our understanding of how iconicity emerges and decays, and how it can shape lexicons.

Supplementary data
Supplementary data is available at JOLEVO online.
Appendix 3. Model outputs for proportions of included sound parameters in the small, big-, pointy-, and round-conditions compared with the control-condition