Size sound symbolism in the English lexicon

Experimental and cross-linguistic evidence suggests that certain speech sounds are associated with size, especially high front vowels with ‘small’ and low back vowels with ‘large’. However, empirical evidence that speech sounds are statistically associated with magnitude across words within a language has been mixed and open to methodological critique. Here, we used a random-forest analysis of a nearexhaustive set of English size adjectives (e.g., tiny, gargantuan) to determine whether the English lexicon is characterized by size-symbolic patterns. We show that sound structure is highly predictive of semantic size in size adjectives, most strongly for the phonemes /ɪ/, /i/, /ɑ/, and /t/. In comparison, an analysis of a much larger set of more than 2,500 general vocabulary words rated for size finds no evidence for size sound symbolism, thereby suggesting that size sound symbolism is restricted to size adjectives. Our findings are the first demonstration that size sound symbolism is a statistical property of the English lexicon. BODO WINTER

There are many different communicative phenomena that reflect iconicity. One phenomenon that has received a lot of attention, often called "size sound symbolism," involves the association of speech sounds with large/small size. For example, speech sounds with more high-frequency components, such as the vowel /i/, are associated with smallness (Tarte 1982;Knoeferle et al. 2017). This is thought to derive from the fact that smaller things (e.g., animals, objects) typically produce higher-frequency sounds (Ohala 1983). This makes the association of particular vowels with size concepts a form of iconicity, because the acoustic qualities of 'small' vowels such as /i/ and /ɪ/ resemble the sounds produced by small things in the world, and vice versa for 'large' vowels such as /a/, /ɑ/, /ɔ/, and /o/.
The association between size and speech sounds has been noted for a long time (Wedgwood 1845;Jespersen 1922), with Sapir (1929) being the first to demonstrate experimentally that Englishspeaking participants match pseudowords such as mil with small objects, and pseudowords such as mal with large objects. A diverse set of experiments has directly or conceptually replicated this basic pattern (Newman 1933;Birch & Erickson 1958;Greenberg & Jenkins 1966;Johnson 1967;Tarte & Barritt 1971;Tarte 1982;Berlin 2006;Parise & Spence 2009;Baxter & Lowrey 2011; P. D. Thompson & Estes 2011;Parise & Spence 2012;Baxter et al. 2015;Auracher 2017;Knoeferle et al. 2017). In addition, typological research has consistently found that high front vowels are associated with the concept 'small' in a large number of the world's languages (Ultan 1978;Fitch 1994;Haynie et al. 2014;Blasi et al. 2016;Johansson et al. 2019). Yet, these crosslinguistic generalizations are often based on a single word pair (e.g., small/large). Does size sound symbolism also characterize larger swaths of vocabulary within a respective language? From the very first experimental studies with nonce words such as Sapir's mil/mal, researchers voiced skepticism whether these studies on pseudowords had anything to do with the actual vocabularies of natural languages (Bentley & Varon 1933).
When speaking of size sound symbolism being a property of the English lexicon, we are interested demonstrating its "systematicity", defined by Dingemanse et al. (2015: 604) as "a statistical relationship between the patterns of sound for a group of words and their usage." Such systematicity is, in theory, orthogonal to iconicity (Nielsen 2016;Nielsen & Dingemanse 2020): a systematic pattern in the lexicon can be motivated by iconicity or it can be non-iconic, such as when resulting from an accident of language history (cf. discussion in Cuskley & Kirby 2013). As an example of a non-iconic systematic pattern, consider the fact that in English, the voiced dental fricative /ð/ occurs word-initially only in function words (the, that, this, there) (Bloomfield 1933: 244). This association between sound and function is not obviously rooted in any form of iconicity, as it is not clear how the phoneme /ð/ could be taken to resemble the corresponding meanings. Other patterns are systematic and iconic, such as the fact that in Chaoyang, ideophones denoting abrupt sounds are more likely to end in stops, such piak and pak 'sound of a gunfire', tiak 'clacking sound of an abacus', and kok 'sound of hen clucking' (Thompson & Do 2019). The systematicity of this pattern is evidenced by the fact that this form-meaning pairing recurs across several different Chaoyang ideophones. The iconicity of these ideophones is presumably grounded in the fact that words ending in stops involve a more abrupt articulatory closure as well as a more abrupt acoustic offset (Rhodes 1994). Evidence that this is a case of iconicity rather than mere systematicity comes from the fact that the same pattern (stop-final words for abrupt offsets) has been reported numerous times for a diverse set of languages (Stoddart 1858;Sommer 1933;Wissemann 1954;Hamano 1998 in a universally accessible sense of resemblance between sound and meaning that is used by speakers of different languages. In the case of English, researchers have been keen to point out that there are salient counterexamples to size sound symbolism, such as notably the adjectives big and small having high and low vowels respectively (Jespersen 1922: 406;Wescott 1971: 421). Empirical analyses of the English lexicon have produced mixed results. Using a thesaurus to assemble a list of words for small and large concepts, Newman (1933) failed to find an association between vowels and size. But as noted by Brown (1958: 119), his list included "many words whose association with size is remote, e.g., decimate, descend, wretched, stalwart." In another analysis of 181 highly frequent English monosyllables, Thorndike (1945: 10) found that /i/ and /ɪ/ are associated with smallness, and that /ɔ/ and /oʊ/ are associated with largeness. However, Thorndike's study was also problematic in that it relied on his own, subjective introspection to judge a word's semantic size.
In later work, Johnson (1967) asked American English speakers to generate as many words as they could that suggest smallness or largeness, yielding 324 unique word types. This set exhibited evidence for size sound symbolism, with /i/ being much more frequent in words for small as opposed to large concepts, in contrast to /a/, /u/, and /o/, which showed the reverse pattern. Yet, this procedure of generating words is problematic because the existence of phonological priming effects and semantic priming effects has the potential to inflate sound symbolic properties. In addition, Johnson's (1967) word list is unpublished, leaving the possibility that it contains many etymologically related forms, which would artificially inflate the sample size by adding non-independent cases to a statistical analysis. Finally, Katz (1986) investigated a set of 60 words with size ratings from Paivio (1975), failing to replicate the association between vowels and size previously established by Thorndike (1945) and Johnson (1967). However, Katz's word list included 'small' nouns such as butterfly, toe, and grape, and his 'large' words included nouns such tractor, iceberg, and elephant. While these words arguably denote small and large objects respectively, they are not words that relate specifically to size.
In summary, despite decades of interest in the phenomenon of size sound symbolism, there is no conclusive evidence that speech sounds are associated with semantic size across distinct word types in the English lexicon. Besides controversy about whether the word lists were suitable (Brown 1958: 119;Taylor 1963: 201), a persistent methodological problem of this line of research is the sole focus on vowels, even though voicing (Taylor 1963 Taylor (1963) already suggested that Newman (1933) may have missed that his list of words for large concepts features an overabundance of the velar stops /g/ and /k/ (gargantuan, glaring, great, gross, colossus, cargo, comprehensive, corporation, corpulence). Thus, what is needed is a statistical approach that takes all potentially relevant phonemes into account. Here, we use a machine-learning algorithm -random forests (Breiman 2001) -that is able to incorporate a large number of predictor variables, allowing us to investigate the simultaneous impact of all English phonemes.

Methods
All data and code are available under the following OSF repository: https://osf.io/9q4nc/.

Size adjective list
Our goal was to compile a word list that was unbiased, as exhaustive as possible, not generated by participants, more homogeneous than previous lists, and included only words directly about size without mixing different parts of speech. To achieve this, we used https://www.thesaurus.com/ to extract 223 words which were listed as synonyms of tiny, small, large, big, and huge. 1 In our analysis, we focus on words that have only one stem (including those with reduplicated stems), 1 A reviewer pointed out that thesauri are still constructed by lexicographers, which opens up the possibility for bias. However, we are not aware of ways that this bias would favor sound symbolic patterns. Moreover, to our knowledge, our word list is a nearly exhaustive of size terms, and so other lists would, presumably, include a very similar set of words. Winter and Perlman Glossa: a journal of general linguistics DOI: 10.5334/gjgl.1646 which excludes pint-sized, pocket-size, pocket-sized, yea big, barn door, small-scale, a whale of a, super colossal, oversize, undersize, and sizable. The word pocket (from pocket-sized) was excluded because its primary sense is nominal. We further excluded words whose first synonym in the thesaurus was not an adjective and/or not obviously size-related. This led to the exclusion of the following 13 forms: populous, prodigious, spacious, copious, ample, substantial, monster, magnificent, generous, immeasurable, insignificant, trifling, minimum.
We stripped all derivational morphology away from the word's root as we do not want grammatical information (e.g., the -ing suffix) to be counted. This also involves excluding the diminutive suffix -y, as in baby (the diminutive of babe), bitsy, bitty, itsy-bitsy, itty-bitty, puny, runty, teensy, teensy-weensy, teeny, and tiny (which comes from tine + -y). Although -y may be a grammaticized version of size sound symbolism (Waugh 1994;Shih & Rudin 2020: 3), including this morpheme would unduly bias our results towards /i/ being associated with small size. Our stemming procedure also excluded Latin and Greek morphology that may not be productive in English anymore (e.g., -al in colossal, or -ic in titanic). Thus, the word humongous is represented as humong-in our data, and the word colossal as coloss-.

Processing Glasgow norms
We additionally analyzed a set of 5,553 words rated for size by 829 native English speakers on a 7-point scale (Scott et al. 2019). These 'Glasgow norms' were collected on word senses, not word forms (e.g., arm 'limb' versus arm 'weapon'). We averaged ratings across word senses, yielding a reduced set of 4,683 words. From this, we extracted all monomorphemic words (based on English Lexicon Project, Balota et al. 2007) that were either nouns, verbs, or adjectives (based on SUBTLEX, Brysbaert et al. 2012), yielding a total of 2,667 words (330 adjectives, 472 verbs, 1865 nouns).

Statistical analysis
All analyses were conducted with R 4.0.2 (R Core Team 2019) and the 'tidyverse' package 1.3.0 (Wickham et al. 2019). We used 'ranger' 0.12.1 (Wright & Ziegler 2017) for random forests, and 'effsize' 0.8.0 (Torchiano 2019) and 'lsr' 0.5 (Navarro 2015) for effect sizes. 2 Humongous is suspected to combine huge and tremendous; giant and gigantic are related; behemontic and mammoth are suspected to have influenced each other; bitty, bitsy, itsy-bitsy, and itty-bity are all related to each other, and so are teeny, tiny, teensy, and teensy-weensy; as well as wee and pee-wee, and miniscular and miniscule. Finally, the word mini is a truncation of miniature, and the min-form has subsequently been extended to other forms, such as minikin (Bolinger 1949: 60 The core of our analysis uses random forests, a machine-learning algorithm that builds an ensemble of decision trees, in our case classification trees because our main response variable is categorical ('large' versus 'small' adjectives). Decision trees work by recursively splitting the data based on a split criterion, such as trying to minimize the variance at each node of the tree. For example, a decision tree could partition the data into those words that have a /t/ as opposed to those that do not have a /t/ based on the fact that many words with this sound are small adjectives, and most words without it are large adjectives (as demonstrated below). Then, within the /t/-featuring words, another branch of the tree could be added by further noticing that those words with /t/ and /i/ are even more likely to be small adjectives, and so on.
Random forests grow such trees using only a random subset of predictor variables that is different for each tree, e.g., one tree may be grown with the predictors /t/, /b/, /o/, another tree with /ð/, /e/, /m/, and so on. The intuition behind this approach is that certain predictors may mask the effects of other predictors, and only by accumulating insight from a number of trees with different combinations of predictor variables do we get a more stable estimate of how much each variable contributes to the overall prediction. In addition to only considering a random subset of variables, each tree is grown on a random bootstrap sample of the data. The performance of each tree is evaluated on the data that the tree has not witnessed, the socalled "out-of-bag sample." Here, we focus on the out-of-bag error rate (OOB error), an estimate of how well the random forest generalizes to unseen data. To the extent that the OOB error is low, it is possible to predict whether size adjectives are large or small from phonological structure alone. That is, a low OOB error indicates systematicity in the lexicon, as it indicates that phonology predicts semantics.
We used random forests because they have been argued to be appropriate for "small n, high p" situations that involve many different features that are potentially collinear (Strobl et al. 2009). Collinearity can be expected when doing analyses of the phonological structure of words since due to phonotactic constraints, particular phonemes tend to occur together. Moreover, we have a large set of predictors (36 phonemes) to predict large/small size semantics, as well as -in the case of the size adjectives -a very small dataset (52 adjectives). Each predictor codes for the presence/absence of a phoneme in the stem. We used this binary presence/absence measure rather than the raw count of phonemes because most phonemes only occur once per word.
Random forests have various parameters that need to be tuned for each dataset (Probst et al. 2019), such as the number of random variables considered for each tree. Here, we used the 'tuneRanger' package version 0.5 (Probst et al. 2019) for hyperparameter tuning, which uses sequential model-based optimization (Hutter et al. 2011) for finding the best parameter values. We used tuneRanger to tune hyperparameters individually for each dataset (see OSF repository for exact specifications: https://osf.io/9q4nc/). We ran all random forests with 1,000 trees. The size adjective analysis features class imbalance, with nearly twice as many large (38) as small (18) adjectives. Because ranger's default "gini" split rule is known to be biased in the presence of class imbalance (Strobl et al. 2007), we used the "extratrees" split rule instead (Geurts et al. 2006).

Size adjectives
For the 52 etymologically unrelated adjectives, the random forest was able to predict the large/ small distinction with very high classification accuracy (training accuracy: 98.1%) and, more importantly, relatively low out-of-bag prediction error (OOB = 22.30%) indicating that this finding would generalize well to unseen data. The accuracy of this random forest is much better than what would be expected if we naïvely assigned the majority category (in this case, large adjectives) regardless of which phonemes a word contains, in which case we would be accurate only 65.38% of the time. This shows that for size adjectives, sound structure is highly predictive of semantic size. Table 1 shows the probabilities of individual words belonging to each category. Only one word, runty, was misclassified. Based on its phonological properties, the random forest assigned it to the 'large' rather than the 'small' set. Additionally, the random forest was undecided for the word small, which was neither assigned to be part of the 'small', nor of the 'large' set. A look at variable importances (Figure 1a) suggests that four phonemes are especially important: /i/, /ɪ/, /t/, and /ɑ/, in order of predictive performance. The vowel /i/ occurred in only 3% of large adjectives as opposed to 28% of small adjectives (Figure 1b). Cramer's V for contingency tables (df = 1) suggests that this is a medium effect size: V = 0.31. The vowel /ɪ/ occurred in 18% of large adjectives as opposed to 39% of small ones (small-to-medium effect size, V = 0.19). The vowel /ɑ/ occurred in 26% of the large adjectives and none of the small adjectives (V = 0.28). The consonant /t/ occurred in 38% of large adjectives but 61% of small adjectives (V = 0.18). It is noteworthy that three out of the four most important phonemes were vowels, which appears to be in line with the fact that most studies focusing on size sound symbolism have focused on vowels (Sapir 1929;Katz 1986;Haynie et al. 2014). To assess the extent to which predictive accuracy depends on vowels and consonants, we ran separate random-forest analyses with consonant predictors and vowel predictors only. Both analyses revealed a stark drop in predictive accuracy (OOB error for vowels only: 34.54%; for consonants only: 34.62%), which suggests that both vowels and consonants are needed to attain high predictive accuracy.

Glasgow norms
Next, we assessed to what extent the patterns observed for size adjectives extend to other types of words. The same 36 binary predictors were used to predict the Glasgow size ratings, which were discretized for comparability with the size adjective analysis (median split). The overall accuracy was 72.21%, much lower than what was observed for the size adjectives. Crucially, the out-of-bag predictor error was nearly double compared to the size adjectives (OOB = 42.11%). This is not a result of the dichotomization of the continuous size ratings, as suggested by the fact that the effect sizes for the most predictive phonemes are also negligible in the continuous case (Cohen's d = 0.19, 0.22, 0.31, for the three most predictive phonemes). Even if we take the 10 th percentile most extreme large/small words, the OOB error is still nearly double as large (41.46%) than for the above analysis size adjectives; the same applies to the 20 th percentile most extreme words (OOB = 37.65%). A look at the variable importances also reveals that the pattern of importances does not make sense with respect to the existing literature on size sound symbolism, with neither /i/ nor /ɪ/ being indicated to have any predictive power. Instead, for example, the most predictive phoneme for the median-split random forest turned out to be /r/, which was slightly more frequent in the 'large' words (31.90%) than in the 'small' words (24.20%). However, the effect size of even this most predictive phoneme was minimal (Cramer's V = 0.08), and much smaller than the effect sizes we observed for size adjectives.
Given the fact that lexical categories differ in how much they are prone to iconicity (Perry et al. 2015;Winter et al. 2017), we also ran separate random forest analyses for nouns, verbs, Figure 1 (a) Relative variable importance (permutationbased); the dashed line represents the absolute value of the lowest variable importance, which can be used as a heuristic cut-off threshold for predictors that contribute strongly to the random forest's overall predictive performance (Strobl et al. 2009); (b) Proportion of size adjectives in the large/small class for the four most predictive phonemes.

Winter and Perlman
Glossa: a journal of general linguistics DOI: 10.5334/gjgl.1646 and adjectives. The predictive error was still twice as high for the adjectives from the Glasgow ratings (OOB = 42.73%) as for the set of size adjectives, and it was similarly high for nouns (OOB = 42.90%) and verbs (OOB = 45.55%). Table 2 summarizes the OOB error for all analyses performed on the Glasgow ratings, highlighting that across the board, the out-of-bag prediction error is much higher than in the case of the size adjectives.

Discussion
Our results show that size sound symbolism is a systematic property of English words, but it specifically resides in the lexical domain of size adjectives. Our results are broadly consistent with empirical evidence from iconicity rating studies which shows that adjectives are generally rated to be high in iconicity, especially when compared to nouns (Perry et al. 2015;). More generally, our results fit with the observation that adjectives are more focused on singling out specific semantic dimensions, especially in contrast to nouns (Lynott & Connell 2013;Winter 2019). Given that size is just one perceptual dimension among many, nouns are arguably too multidimensional to allow for strong generalizations on size sound symbolism. This may partially explain why Katz (1986) obtained a null result in his analysis: for many of the words he considered (e.g., 'small': grape, toe, butterfly; 'large': tractor, iceberg, elephant), size is only one semantic dimension among many others. We conclude that in investigations of size sound symbolism, it matters where one looks in the lexicon.
It also matters how one investigates size sound symbolism. Our study improved on past methodology by being agnostic to which phonemes should matter for size sound symbolism. Rather than performing separate hypothesis tests for individual phonemes, we used a bottomup machine-learning algorithm that treats all phonemes equally. This method converges on the high front vowels /ɪ/ and /i/ predicting 'small', even though we excluded the diminutive suffix -y. In contrast, the low back vowel /ɑ/ is associated with 'large'. The only consonant that mattered was /t/, a voiceless plosive, which was associated with 'small'. Thus, altogether our results suggest that there are more English vowels that matter to size sound symbolism than consonants. However, the comparison between the random forest that considers all phonemes as opposed to the random forests that only used vowel predictors or consonant predictors shows that all phonemes are needed to achieve an accurate prediction. This may further explain why past research has failed to find size sound symbolism, since researchers, following the early experimental investigations of Sapir (1929), have generally focused their studies on vowels (e.g., Thorndike 1945;Johnson 1967;Katz 1986;Haynie et al. 2014).
It is interesting that specifically the voiceless alveolar stop /t/ turned out to be the most predictive consonant. Kawahara et al. (2018) observed that larger fictional creatures in the Japanese video game Pokémon were more likely to contain more voiced obstruents. Several  other studies have found voicing to be associated with size in various languages (Klink 2000;Haryu & Zhao 2007;Shinohara & Kawahara 2010; P. D. Thompson & Estes 2011;Johansson et al. 2019). A number of different acoustic cues distinguish voiced from voiceless stops in English (Lisker 1986). Among other cues, voiced stops induce lower pitch in surrounding vowels (Kingston & Diehl 1994). In addition, voiceless stops have higher spectral components in the release of the stop than voiced consonants (Chodroff & Wilson 2014). Interestingly, these high spectral components are highest for the alveolar place of articulation (Chodroff & Wilson 2014), which could be a factor in making /t/ particularly suitable for the depiction of small size. All of this is suggestive of John Ohala's Frequency Code (Ohala 1983), according to which iconicity for size mimics the acoustics of small objects or animals, which tend to produce higher frequency sounds.
Our results also provide further support for the idea that sound symbolism is a probabilistic phenomenon . Given that phonemes primarily serve to distinguish meanings within the lexicon, it is futile to look for strict rule-like correspondences, such as all /i/ being always 'small'. Past discussions of size sound symbolism have over-emphasized individual counterexamples, thereby obscuring broader patterns in the lexicon. For example, linguists have repeatedly noted that small is as an exception to size sound symbolism (Jespersen 1922;Wescott 1971), and indeed, our bottom-up machine learning algorithm agrees with this intuition. However, along with runty, small turns out to be one of just two exceptions across all English size adjectives. Thus, these results show that an over-emphasis of a few individual examples can distract from generalizations that can be made across cohorts of words.
It is useful to compare our findings to the literature on the cross-linguistic study of size sound symbolism, which has reported similar findings (Ultan 1978;Haynie et al. 2014;Blasi et al. 2016;Johansson et al. 2019). These studies generally focus on fewer words in order to cover more languages, thus trading cross-linguistic breadth for within-language depth. The findings of these studies show that the pattern that we have observed here across words in a language also exists across languages. These two facts taken together suggests that the systematicity established here is rooted in a genuine crossmodal correspondence between sound and size, rather than just being a statistical fluke of the English language. Thus, the association of high/low-frequency sounds with smallness/largeness is a universal statistical tendency, and it can be found both across languages and across distinct word types within languages, as demonstrated here for English.
Returning to Bentley and Varon's (1933) criticism that Sapir's (1929) experiment of nonce words like mil and mal has nothing to say about the English lexicon, our analyses clearly show that, in fact, the English lexicon does harbor size sound symbolism. This finding is in line with a growing body of work showing that the general vocabulary, not just 'specialized' words such as onomatopoeias or ideophones, feature more iconicity than has traditionally been acknowledged (Haynie et al. 2014;Blasi et al. 2016;Joo 2020;Sidhu et al. 2021). Moreover, as we have shown here by correlating phonology and semantics, this iconicity does not only characterize a few words, but larger sets of words, such as our set of English size adjectives. This demonstrates that iconicity can play a role in shaping the vocabulary of spoken languages.