Introduction

Age of acquisition (AoA) refers to the age at which a word is acquired, and was first proposed to be a critical factor influencing word production even when word frequency was controlled (Carroll & White, 1973). Decades of research has consistently demonstrated theoretical and practical values of this variable beyond word production (cf. Juhasz, 2005). A quick search on the Web of Science in the more recent literature from 2006 to 2019 returned almost 800 articles that, either exclusively or partially, focused on the effects of AoA. As shown below, these studies ranged from word recognition to word production, from healthy participants to abnormal populations, from behavioral performance to computational modeling, and from language development to language evolution.

Research has documented the AoA effect with both comprehension and production tasks. In lexical processing, the AoA effect whereby early-learned words were processed more efficiently than late-learned words was found not only with visually presented words, but also with auditorily presented words (Räling, Holzgrefe-Lang, Schröder, & Wartenburger, 2015); not only for single words, but also for multiword phrases (Arnon, McCauley, & Christiansen, 2017); not only among adults, but also among children (Hsiao & Nation, 2018); not only related to L1 processing, but also applicable to L2 processing (Dirix & Duyck, 2017). In terms of word production, researchers reported that AoA was predictive of performance in oral, handwritten, and typewritten word production tasks (Laganaro, Valente, & Perret, 2012; Perret, Bonin, & Laganaro, 2014; Scaltritti, Arfé, Torrance, & Peressotti, 2016).

In addition to research for healthy participants, the effects of AoA have been investigated for patients with language deficits, for example, patients with aphasia (Brysbaert & Ellis, 2016; Khwaileh, Body, & Herbert, 2017; Räling, Schröder, & Wartenburger, 2016) and schizophrenia (Juhasz, Chambers, Shesler, Haber, & Kurtz, 2012; Smirnova, Clark, Jablensky, & Badcock, 2017). In these studies, AoA was examined to evaluate word retention, semantic access, and verbal fluency in these patients. Räling et al. (2016) further proposed the potential utility of AoA in diagnosis and treatment of refractory semantic access disorders (Forde & Humphreys, 1997; Crutch & Warrington, 2005).

Further, AoA has been investigated not only via a behavioral approach, but also through computational modeling and neural network simulation (e.g., Chang, Monaghan, & Welbourne, 2019; Sohrabi, 2019). These studies examined interactions among orthography, phonology, and semantic representations of words, the time course of the AoA effect, and the relatively greater resilience of early-acquired words to brain damage in order to pinpoint the locus and the mechanism of the AoA effect in word naming.

Finally, the AoA effect has been investigated in many different languages, including English (e.g., Brysbaert, 2017), Dutch (e.g., Menenti & Burani, 2007), Italian (e.g., Wilson, Ellis, & Burani, 2012), Farsi (e.g., Sohrabi, 2019), Persian (e.g., Bakhtiar, Su, Lee, & Weekes, 2016), and Chinese (e.g., Liu, Hao, Shu, Tan, & Weekes, 2008). AoA has also been studied beyond the scope of a single language. Monaghan and colleagues established a link between language acquisition and language evolution whereby AoA was associated with the probability of word borrowing from one language to another (Monaghan, 2014; Monaghan & Roberts, 2019).

AoA ratings

Given its theoretical and clinical implications as evidenced in the wide range of studies, ratings of AoA are often needed for research purposes. Currently, AoA ratings have been published for 30,000 English words (Kuperman, Stadthagen-Gonzalez, & Brysbaert, 2012) and 30,000 Dutch words (Brysbaert, Stevens, de Deyne, Voorspoels, & Storms, 2014). In addition to these large-scale databases, collections of AoA ratings for over 1000 words can be found for French (Ferrand, Bonin, Méot, Augustinova, New, Pallier, & Brysbaert, 2008), German (Birchenough, Davies, & Connelly, 2017), Italian (Montefinese, Vinson, Vigliocco, & Ambrosini, 2019), Spanish (Davies, Izura, Socas, & Dominguez, 2016), and Portuguese (Cameirão & Vicente, 2010).

We found five studies that had reported AoA ratings for Chinese words at the time of preparing this manuscript, two of which collected ratings for single-character words only. Bai and Chen (2011) reported ratings for 258 single-character words, including nouns and verbs. Liu, Shu, and Li (2007) collected ratings for 2423 single-character words, including nouns, verbs, and adjectives.

The other three studies employed a picture-naming task (Snodgrass & Vanderwart, 1980). First, Chen, You, and Zhou (2007) provided AoA ratings for 56 object names using pictures from Snodgrass and Vanderwart and a seven-point scale with 1–6 representing six two-year age bands (e.g., 1 = 0–2 years old and 6 = 11–12 years old) and 7 representing 13 years or older. The dominant names of these pictures were either single-character words or two-character words. In addition, the study collected ratings for 180 single-character Chinese words on the same seven-point scale. Second, Weekes, Shu, Hao, Liu, and Tan (2007) collected AoA ratings for 232 object names using the same pictorial materials from Snodgrass and Vanderwart. The length of these names ranged from one to four Chinese characters. Lastly, Liu, Hao, Li, and Shu (2011) collected both objective and subjective AoA ratings for 435 object pictures whose dominant names ranged from one to four characters. The objective ratings were based on the performance of 422 children on a picture-naming task, and the subjective ratings were based on 48 university students’ AoA estimates about these pictures’ dominant names on a seven-point scale. In addition, they collected naming latency, name agreement, and other norms based on the children’s performance, and concept familiarity, word frequency, and image agreement based on the university students’ assessment.

These collections of ratings provide important psycholinguistic information about Chinese words, but are limited in scale relative to resources available in other languages. This leaves researchers to collect their own AoA ratings for whatever is needed in their individual research projects (e.g., Bai, Ma, Dunlap, & Chen, 2013; Chen, Dent, You, & Wu, 2009; Crepaldi, Che, Su, & Luzzatti, 2012; Law & Yeung, 2010; Xu, Kang, & Guo, 2016; Yu, Law, Han, Zhu, & Bi, 2011) and certainly weakens comparability across studies. To fill the void, the present study reports AoA ratings for 19,716 simplified Chinese words. As indicated by our brief review of recent literature, we believe these ratings will benefit researchers in many different fields, such as psycholinguistics, clinical psychology, natural language processing and computational modeling, and developmental and evolutionary linguistics.

Method

Participants

We collected AoA ratings from a total of 2253 participants through online questionnaires. All but six, who were excluded from data analysis, were native speakers of Mandarin Chinese, and spent most of the first seven years of their lives in mainland China. They received monetary compensation for completing the study.

Based on data screening criteria (see “Results” section), we had to remove 482 participants mainly due to a large number of invalid ratings. There were therefore 1765 participants (52.6% women) that remained. Their ages ranged from 14 to 54 years (mean = 30.5, SD = 7.4), with five participants failing to provide this information. Education level ranged from elementary school to graduate school, with 95.6% college-level or above college-level education. Figure 1a and b present frequency data for age and education level, respectively, of the 1765 participants. In contrast to past rating studies, we were able to recruit raters from a wide range of geographical regions. Figure 1c shows the geographical distribution of the participants.

Fig. 1
figure 1

a Age distribution of the participants. b Education level distribution of the participants. c Geographical distribution of the participants

Word sample

The word sample contained words from two sources. The first source was eight yearly reports, Language Situation in China, published by the National Language Commission (NLC), Ministry of Education, PRC (2011–2018). The reports compiled words that appeared in China’s official news outlets, including newspapers, radio and TV broadcasting programs, and news websites. There were in total 16,221 two-character words in common across the eight yearly reports. We applied CorpusWordParser (www.corpus.org; Xiao, 2014) to help determine the part of speech for each of these words. Based on this information, we retrieved nouns, verbs, and adjectives from the 16,221 words. We found that some adverbs were mistakenly labeled by CorpusWordParser as adjectives, e.g., 极度 “to the upmost” and 顶多 “at most,” and we manually removed them. The second source was the MEga study of Lexical Decision in Simplified CHinese (MELD-SCH; Tsang et al., 2018). For the purpose of their study, Tsang et al. (2018) retrieved 20,000 one- to four-character words from the SUBTLEX-CH corpus (Cai & Brysbaert, 2010). After removing proper nouns, their sample contained 12,578 words. Through a close examination, we found that many of the one-character words (n = 1020) were rare or nonwords. Specifically, there were some extremely uncommon characters (e.g., 蜱, 胍, and 潞), characters that are typically used in combination with other characters to form a word (e.g., 蒟, 珐, and 噌), and characters that seem to only appear in Classical Chinese (e.g., 炷, 孛, and 乜) or in translations of foreign names for people and places (e.g., 圭, 堺, and 樋). Further, as is well known, lexical ambiguity is a pervasive issue with Chinese characters (e.g., Liu et al., 2007), which would have a negative impact on rating reliability and requires extensive research. We therefore dedicated this study to two- and multi-character words and excluded the one-character words from our word sample.

Finally, we also removed the following from the candidate words of the two different sources: (1) words with r-ending retroflexion (e.g., 托儿 “accomplice”) where the second character儿 functions as a phonological suffix to a word and is usually not written out (Lv, 1984; Huang & Liao, 2017), (2) words composed of English alphabets (e.g., B超 “ultrasonography”), (3) words denoting numerical concept (e.g., 十五 “fifteen”), (4) abbreviations of words that contain multiple characters (e.g., 高院 for 高级人民法院 “supreme court”), and (5) words unknown to five or more participants in another rating study conducted in our lab (Xu & Li, 2020). In the end, there were 14,400 words from the NLC yearly reports (2011–2018), and 11,328 words from the MELD-SCH (Tsang et al., 2018), with 6012 words in common. The final word sample thus contained a total of 19,716 words, including 18,180 two-character words, 949 three-character words, and 587 four-character words.

Rating collection

The words were divided into two batches, and AoA ratings were collected following the same procedure. The first batch of 14,400 two-character words, sampled from the NLC reports (2011–2018), were evenly divided into 60 lists of 240 words, roughly matching on word frequency retrieved from the NLC reports (2011–2018). The second batch of the remaining 5316 words, sampled from the MELD-SCH (Tsang et al., 2018), were divided into 24 lists of 221–222 words, roughly matching on word frequency retrieved from SUBTLEX-CH (Cai & Brysbaert, 2010). To inspect the reliability of the ratings, we included in each of these 24 word lists 20 additional words sampled from the first batch. Care was taken in this process to ensure that no extremely high-frequency words or extremely uncommon words were included in these 20 words, because they might lead to an overestimate of rating consistency between the two batches. Therefore, there were a total of 241–242 words on each list of the second batch. Each word list was randomly assigned to 25–35 participants, along with demographic questions attached to the end of the list. The words were presented to the participant in random order. There was a box to the right of each word for the participant to enter an estimated AoA. The demographic questions asked the participant to report gender, age, education level, first language, and the place that they spent most of the first seven years of their lives (Kuperman et al., 2012).

To collect AoA ratings, we adopted the instructions of the Bristol norms (Stadthagen-Gonzalez & Davis, 2006) and Kuperman et al. (2012). For each word, participants were asked to recall and enter the age (in years) at which they had learned it. In keeping with these previous studies, we further specified to the participants that “having learned a word” meant they could understand the word if they heard it at the age they entered, even if they had not used, read, or written it at the time.

Some previous studies utilized a Likert scale to collect AoA norms (e.g., Chen et al., 2007; Weekes et al., 2007). The primary reason that we collected estimates of AoA instead of responses on a Likert scale was to increase comparability across studies conducted both within the same language (in this case, Mandarin Chinese) and even between different languages. That is, “seven years old” would mean “seven years old” in different studies, for different persons, and across different cultures. In contrast, the number “7” on a Likert scale could be labeled and interpreted differently across studies, persons, and cultures. In addition, citing Ghyselinck, de More, and Brysbaert (2000), Kuperman et al. (2012) pointed out the shortcomings of the Likert scale approach in AoA collection, including restricted range and difficulty in using the scale. However, despite differences between the two approaches, Kuperman et al. reported a rather strong correlation (r = .93) for ratings collected though estimates of AoA and responses to a Likert scale.

We required the participants to provide a rating for every word on the list. However, they were free to withdraw at any time. If they encountered a word that they did not know, they were instructed to enter the number “0.” As in previous studies, this was to increase the chance that we collected faithful ratings. It took participants on average 19 minutes to complete the word list and the demographic questions at the end.

Results

Data screening

We first excluded participants who were 55 years old or older, for the following reasons. First, from the mid-1950s to the mid-1960s, the Chinese language underwent a series of reforms, including orthographic simplification and spelling transformation (Honorof & Feldman, 2006; State Council, PRC, 1956; The Committee on Script Reform, PRC, 1964). Second, since the late 1970s, China has been experiencing rapid economic, social, and cultural changes. Therefore, the older generations likely had a different trajectory of vocabulary development relative to the younger generations. In addition, Kuperman et al. (2012) reported that older adults assigned higher AoA ratings than did younger adults, possibly due to a broader age range to choose from. Also taking into consideration that, practically, only a limited number of participants 55 or older (2.0% of 2253 participants) completed the study, we excluded these participants in order to gain a greater level of homogeneity of ratings and provide more reliable AoA norms. We therefore would like to remind colleagues in the field of geriatric research, e.g., semantic dementia, to use the AoA ratings reported in this study with caution. Similarly, we encourage developmental researchers of young children to use performance-based AoA assessments (e.g., Liu et al., 2011), as the current ratings were collected from a sample consisting mostly of adults (99.3% aged 18 or above).

With regard to educational background, we included raters of all different education levels including those who completed only elementary school (n = 2), middle school (n = 10), and high school (n = 66). Although vocabulary development benefits largely from formal education, it is not the only way for someone to learn words. In addition to formal schooling, news and entertainment are probably two main sources for most people to either acquire or consolidate vocabulary knowledge. Words included in the current word sample were either from news outlets (NLC reports) or from films and TV programs (SUBTLEX-CH). We therefore considered it reasonable to include participants without a “full” formal education experience.

We then marked out each participant’s ratings, if any, that were greater than the participant’s age as invalid ratings. Kuperman et al. (2012) trimmed ratings exceeding 25 years of age as outliers. For each participant, we also marked out ratings greater than 25. After inspection, we found that participants with an excessive number of ratings exceeding 25 were indeed unusual. In fact, an excessive number of ratings exceeding the participant’s age, excessive number of ratings exceeding 25, and excessive number of ratings of “0” (representing “I don’t know the word”) often co-occurred, suggesting inattentiveness of these participants to the task. Therefore, we excluded participants for whom more than 15% (36 out of 240–242 words) of ratings exceeded their age, were greater than 25, or were ratings of “0.” This resulted in the exclusion of 248 participants, 11.0% of the total of 2253 participants.

Next, we implemented three additional exclusion criteria. First, we examined each participant’s frequency distribution. A participant was excluded if his or her ratings were distributed entirely or almost entirely between 0 and 9 (single-digit numbers). In a pilot study, we tested the rating instruction and procedure in a lab where we observed participants rate a smaller set of word stimuli. Some participants entered mostly single-digit responses as a way of conveniently rushing through the process, as they were required to provide a response to each word. We therefore took it as a sign of noncompliance with task instructions. Second, a participant was excluded if the majority of his or her ratings contained only a restricted and implausible age range, e.g., 17–20. Lastly, a participant was excluded if his or her ratings correlated negatively with the rest of the participants assigned to the same word list. These criteria resulted in the exclusion of a further 190 participants, 8.4% of the total of 2253 participants. The resulting data set thus contained 1765 participants.

Finally, before we counted the number of valid ratings and computed mean AoA and standard deviation for each word, we removed all “0” ratings (n = 4233) and ratings greater than the corresponding participant’s age (n = 713), which in total accounted for 1.2% of the ratings from the 1765 participants. As a result, the total number of valid ratings was 408,375, and the number of valid ratings for each word ranged from 10 to 28. Words with 15 or more valid ratings accounted for 99.7% of the entire list of 19,716 words. Only 68 words (.3%) had 14 or fewer valid ratings. We did not remove these words from the list, but provided the number of valid ratings associated with each word in the database so that future researchers could use the ratings at their own discretion. In addition, because some words had a limited number of raters, and the raters varied in age and education level, to help users gauge the representativeness of the raters for each word, the demographics of the raters for each word, including gender makeup, range of age, and range of education level, are also provided in the database.

In the end, we evaluated raters’ level of agreement for each of the 84 word lists. The values of Cronbach’s alphas ranged from .82 to .94, with a mean of .89 (SD = .03). Following Kuperman et al.’s (2012) approach, we also computed an estimate of split-half reliability. Correlation between ratings of odd- and even-numbered participants was .69, which yielded an estimate of split-half reliability of .82. Finally, we examined rating reliability between the two batches. The 20 words retrieved from the first batch were embedded in each of the 24 word lists of the second batch. Correlations of the ratings for these 20 words from each of the 24 lists and ratings from the first batch ranged from .92 to .97, with a mean of .95 (SD = .02). (As these ratings were highly correlated, we reported in the database the mean ratings computed based on ratings from the first batch as the AoA norms for these 20 words.)

Data analysis

In this section, we first examined the correlation between AoA ratings in the present study and AoA ratings from a past study (Liu et al., 2011) in order to evaluate the validity of the current ratings. Then, we largely modeled after Kuperman et al.’s (2012) data analysis approach in order to compare the characteristics and the effects of AoA between the Chinese language and the English language.

Correlation between the present and past AoA ratings

There were 271 words in common between words on our list and words included in Liu et al. (2011). As indicated earlier, Liu et al. provided both subjective AoA ratings from adults and objective AoA assessments based on children’s picture-naming performance. Correlation analysis showed that their adults’ AoA ratings were highly correlated with the mean AoA ratings of the present study, r (271) = .742, p < .000001 (Fig. 2a). Their children’s naming performance was moderately correlated with the mean AoA ratings of this study, r (271) = .432, p < .000001 (Fig. 2b), which was comparable to the level of correlation between their adults’ ratings and children’s naming performance for this set of common words, r (271) = .431, p < .000001 (Fig. 2c).

Fig. 2
figure 2

a Scatterplot of mean AoA ratings from the present study and adults’ AoA ratings from Liu et al. (2011). b Scatterplot of mean AoA ratings from the present study and children’s AoA from Liu et al. (2011). c Scatterplot of adults’ AoA ratings and children’s AoA from Liu et al. (2011)

The relatively lower levels of correlation between adults’ ratings and children’s performance should be attributable to task difference: subjective rating task versus objective performance-based assessment (Brysbaert, 2017). On the one hand, the difference might be the result of different criteria for acquisition. That is, picture naming tested children’s ability in word production, whereas AoA rating by adults was a retrospective self-evaluation of word comprehension. Instructions for AoA rating tasks, of course, could specify to adult participants that they need to enter the age at which they first produced a word, but this not only seems to be an extremely difficult task, but also underestimates one’s level of vocabulary knowledge. One may agree that there are words in our mental lexicon that we can understand but never seem to have used, which is probably particularly the case for words outside of our specialty areas. In addition, as can be seen in Fig. 2b and c, there were many words that were not reliably produced until the age of 10 according to children’s naming performance, the meaning of which however appeared to have been learned well before the age of 10, even as young as 6 (“3” in Liu et al., 2011), according to adults’ ratings.

On the other hand, the AoA rating task by adults suffers a limitation with regard to words acquired in early childhood. Infantile or childhood amnesia refers to the phenomenon wherein adults have little or spotty memory of the first few years of their lives, especially 0–3 years (for a more recent review, see Alberini & Travaglia, 2017). This inability to recall early experiences is clearly reflected in AoA ratings collected from adult participants. Specifically, Fig. 2b and c show that, of the 271 words, no or few words received a mean AoA rating of 4 (“2” in Liu et al., 2011) or earlier by adult raters, in contrast to the performance-based assessment, which indicated that children were capable of producing a decent portion of these words before the age of 4 or even the age of 3. Because of this limitation of adults’ AoA ratings, we reiterate our recommendation that research on children’s language development utilize the performance-based AoA assessment approach rather than adults’ AoA ratings.

Demographic variables and AoA ratings

The following analysis examines the relationship between demographic factors and AoA ratings. We focused our analysis on gender and age. The reason to exclude education level from this analysis was that the majority of the participants had college-level education (85.9%). Only 4.4% had below college-level education and 9.6% had beyond college-level education (Fig. 1b).

The analysis showed that women assigned slightly earlier AoA ratings than men (Table 1). The difference was significant, t(19715) = −8.55; p < .000001. Among all participants, 48.1% were below the age of 30, whereas 40.9% reported age of 30–39. The remaining 10.9% reported age of 40 or above. Considering that some of the 84 word lists did not receive ratings from participants aged 40 or above, we compared the first two age groups to explore potential age group differences in AoA ratings. Younger participants (< 30 years old) assigned earlier AoA ratings than older participants (30–39 years old), t(14399) = −49.93, p < .000001 (Table 1). This is consistent with the report by Kuperman et al. (2012) that older participants tended to assess greater AoA ratings than younger participants possibly due to a greater age range to choose from.

Table 1. Means and standard deviations (SD) of AoA ratings by two gender and two age groups

Past research has shown that word frequency and AoA share an intriguing relation (e.g., Baayen, Milin, & Ramscar, 2016; Brysbaert, 2017; Cortese & Schock, 2013; Zevin & Seidenberg, 2004). On the one hand, they are correlated, with early-learned words being high in frequency relative to late-learned words. On the other hand, they are two constructs of different origins and appear to tap into different underlying mechanisms of word processing. With the large number of words on our list, we examined the correlations of AoA ratings and the SUBTLEX-CH word frequency (Cai & Brysbaert, 2010) for the two gender groups and the two age groups, respectively, in order to examine how demographic factors might moderate the relation of AoA and word frequency.

SUBTLEX-CH provided word frequency for 18,569 of the 19,716 words on our list. The following analysis was based on logarithmic transformation of SUBTLEX-CH frequency for 18,569 words. Fisher’s r-to-z transformation was performed to compare the magnitudes of correlation coefficients. Between the two gender groups, women’s AoA (r = −.369, p < .000001) had a stronger correlation with word frequency relative to men’s AoA (r = −.326, p < .000001), z = 4.71, p < .000001. As for age, the younger group’s AoA (r = −.364, p < .000001) had a stronger correlation with word frequency relative to older group’s AoA (r = −.324, p < .000001), z = 4.37, p < .000001. That is, although the negative correlation of AoA and word frequency were robust, both gender and age appear to moderate the strength of the correlation.

Vocabulary growth

Following Kuperman et al.’s (2012) approach, we generated vocabulary growth curve based on AoA ratings (Fig. 3a). First, we classified words into yearly bins according to their mean AoA ratings. Then, we calculated percentages and cumulative percentages of acquired words for each yearly bin.

Fig. 3
figure 3

a Vocabulary growth curve generated according to mean AoA ratings. b Vocabulary growth curves generated according to mean AoA ratings for the two gender groups. c Vocabulary growth curves generated according to mean AoA ratings for the two age groups

To continue investigating gender and age effects, we also plotted growth curve for each of the two gender groups and each of the two age groups (Fig. 3b and c). Between two gender groups, women seemed to have acquired more words than men before and during the years of elementary school education (Table 2). This is consistent with past reports that girls outperformed boys in vocabulary knowledge and other language skills at the early ages (Etchell, Adhikari, Weinberg, Choo, Garnett, Chow, & Chang, 2018; Lange, Euler, & Zaretsky, 2016; Lung, Chiang, Lin, Feng, Chen, & Shu, 2011; Stangeland, Lundetræ, & Reikerås, 2018; Toivainen, Papageorgiou, Tosto, & Kovas, 2017), and that boys appeared to catch up after age 10 (Lange et al., 2016; Lung et al., 2011; Toivainen et al., 2017).

Table 2. Growth curve differences (%) between two gender groups and two age groups

The younger group (< 30) appeared to have acquired more words than the older group (30 – 39) during the years of elementary and middle school education (Table 2). There are two possible explanations for this difference. First, roughly speaking, the younger group were born in or after the 1990s whereas the older group before the 1990s. The younger group’s advantage in vocabulary growth during school years could be a reflection of social, economic, and cultural developments in China where formal education had been increasingly emphasized. That is, this difference could be a cohort effect. Second, it could be due to demand characteristics, as Kuperman et al. (2012) speculated that participants who were older had a wider age range to choose from when assessing the AoA. More realistically, these two factors, cohort and age, might have had a cumulative effect (Hobcraft, Menken, & Preston, 1982). Given its prominence (e.g., 13.6% difference at age 12), this difference is worthy of further investigation.

AoA and lexical decision

As extensively discussed by Kuperman et al. (2012), in addition to the word frequency effect, AoA accounts for variance in performance on lexical decision tasks as measured by both reaction time and error rate. Words acquired earlier in life tend to elicit responses more rapidly and with a greater level of accuracy. To verify this AoA effect with our ratings, we retrieved reaction time and error rate from the MELD-SCH by Tsang et al. (2018), who collected normative lexical decision data for simplified Chinese words. As indicated in the “Method” section, our word sample included a total of 11,328 words from the MELD-SCH. We found that in the MELD-SCH, the second character of the word 魁梧 was mistaken with another character, and we therefore excluded the word from the following analysis.

In keeping with previous studies (Brysbaert & New, 2009; Kuperman et al., 2012; Tsang et al., 2018), we took standardized reaction time (zRT) as the measure for reaction speed, and error rate as the measure for accuracy. We first verified the correlations of AoA ratings with zRT and error rate (Table 3). AoA ratings were positively correlated with both measures. That is, for words learned later in life, participants’ reaction times were longer, and they made more errors. Table 3 also presents other variables examined in the lexical decision mega-study by Tsang et al. (2018), which was shown to be predictive of performance on lexical decision tasks. Among these, Tsang et al. reported that contextual diversity (Cai & Brysbaert, 2010), taken as an alternative measure for word frequency, accounted for a substantial portion of variance in both zRT and error rate relative to other variables. In order to be able to directly compare our data with Tsang et al.’s model on lexical processing of Chinese words, we entered contextual diversity into the following analyses in place of word frequency. We, however, also ran the same analyses with SUBTLEX-CH word frequency, which yielded essentially the same outcomes due to the almost perfect correlation between these two variables in our word sample, r (11327) = .99.

Table 3. Correlations (N = 11,327) of lexical decision measures (zRT and error rate), AoA, and predictor variables retrieved from past studies (Cai & Brysbaert, 2010; Tsang et al., 2018)

Table 3 shows that, compared to contextual diversity (Log10_CD), AoA ratings had weaker correlations with both zRT and error rate. For illustration, Figs. 4 and 5 depict the correlations of zRT with AoA and Log10_CD, respectively.

Fig. 4
figure 4

Scatterplot of response time (zRT) and AoA ratings

Fig. 5
figure 5

Scatterplot of response time (zRT) and contextual diversity (Log10_CD)

We then determined whether AoA ratings could account for extra variance in zRT and error rate above and beyond other variables including frequency as indicated in past research (Kuperman et al., 2012 with English; Tsang et al., 2018 with Mandarin Chinese). Kuperman et al. (2012) reported that, after the introduction of word frequency, number of letters, number of syllables, and similarity to other words, AoA ratings accounted for an additional 3.8% of variance in zRT and an additional 9.8% of variance in error in the lexical decision task with English words. Their models explained 65.3% of variance in zRT and 43.3% in error rate. To test this effect of AoA on the processing of Chinese words, we first entered four predictors into our regression model. These were contextual diversity as the measure for word frequency, number of strokes as the measure for visual complexity, number of characters in place of number of syllables, as there is a one-to-one correspondence between character and syllable in Mandarin Chinese, and the total number of words that can be formed by each of the constituent characters as the measure for similarity to other words.

The regression analysis with a forward method (Table 4) revealed that contextual diversity (Log10_CD), number of strokes (Stroke), and number of characters (Character) were significant in predicting zRT, while Log10_CD, Character, and number of words formed (Log10_NWF) were significant in predicting error rate. Among these predictors, Log10_CD emerged as the most prominent predictor, and the model accounted for 38.7% of variance in zRT and 19.2% in error rate. We then included AoA ratings in the model, which explained an additional 3.5% of variance in zRT, comparable to the results of Kuperman et al. However, in terms of error rate, AoA ratings only explained an additional 2.5% of variance.

Table 4. Significant predictors for zRT and error rate based on Kuperman et al.’s model (2012)

In their lexical decision mega-study, Tsang et al. (2018) included cumulative frequency (i.e., frequency of all words that contain one of the constituent characters), number of meanings (i.e., number of meanings that the constituent characters represent), and number of pronunciations (i.e., number of pronunciations that the constituent characters have), in addition to the predictors in the above model (i.e., contextual diversity, number of strokes, number of characters, and number of words formed). Tsang et al.’s model accounted for approximately 40% of variance in zRT and 20% in error rate. We tested the AoA effect with this model as well.

The model with a forward method that contained these additional predictors yielded similar outcomes as the one without, accounting for 39.4% of variance in zRT and 19.7% of variance in error rate. The introduction of AoA again explained an additional 3.4% of variance in zRT and 2.4% in error rate. Table 5 presents the outcomes of the final models. By comparison with the outcomes presented in Table 4, Table 5 shows that the effect of AoA on the efficiency of lexical decisions appeared to be relatively independent of these additional predictors. That is, the introduction of these additional predictors did not seem to impinge on the unique effect of AoA.

Table 5. Significant predictors for zRT and error rate based on Tsang et al.’s model (2018)

As pointed out by some researchers (e.g., Cortese & Schock, 2013), the predictive power of word frequency seems to be limited when it comes to the processing of low-frequency words, which might explain in the above analyses the additional variance accounted for by AoA above and beyond contextual diversity. We therefore performed a tertiary split based on word frequency, and implemented the same analyses above for each of these three categories of words: low-frequency words (Log10 SUBTLEX-CH frequency < 1.431; n = 3848), medium-frequency words (Log10 SUBTLEX-CH frequency = [1.431, 2.164); n = 3698), and high-frequency words (Log10 SUBTLEX-CH frequency > 2.164; n = 3781). Note: we report these cutoff values for trichotomy in terms of SUBTLEX-CH, the standard measure for word frequency, for easy reference by future researchers.

First, correlation analyses indicated that AoA showed stronger correlations with both response time (zRT) and error rate (Error) for low-frequency words relative to high-frequency words. The correlation of contextual diversity (Log10_CD) with response time showed the opposite pattern, i.e., a weaker correlation for low-frequency words relative to high-frequency words. On the other hand, similar to AoA, the correlation between contextual diversity and error rate decreased from low-frequency words to high-frequency words (Table 6).

Table 6. Correlations of zRT and error rate with AoA and contextual diversity (Log10_CD) for low-, medium-, and high-frequency words

We then implemented Kuperman et al.’s (2012) model at each of the three frequency levels with zRT and error rate as the criterion variables. The regression analysis with a forward method revealed that AoA emerged as the most important predictor of both speed and accuracy for the processing of low-frequency words, whereas contextual diversity was most predictive of both speed and accuracy for the processing of high-frequency words (Table 7).

Table 7. Significant predictors for zRT and error rate selected based on Kuperman et al.’s model (2012) for low-, medium-, and high-frequency words

In summary, we carried out the same analyses on the entire word sample and then on words of high, medium, and low frequency. The analyses on the entire word sample indicated that contextual diversity, as a measure for word frequency, had the most predictive power for speed and accuracy of lexical decisions. However, the outcomes presented in Tables 6 and 7 appear to highlight the different roles of AoA and contextual diversity in predicting the efficiency of lexical decisions. Specifically, in terms of speed, when word frequency decreased, the predictive power of contextual diversity decreased, whereas the predictive power of AoA increased. However, in terms of accuracy, when word frequency decreased, the contribution of both contextual diversity and AoA to the model gradually increased. On the one hand, this could be partially attributed to a ceiling effect with high-frequency words wherein the accuracy rate was extremely high. On the other hand, the contrasts in outcomes between low-frequency and medium-frequency words seem to suggest the possibility that other factors (e.g., semantic ambiguity) came into play in determining error rate when word frequency increased. Finally, the fact that both AoA and contextual diversity exerted a significant influence on speed and accuracy with the entire sample and at all three frequency levels attests to the robustness of their effects.

Discussion

This study reports AoA ratings for 19,716 simplified Chinese words. The validity of the ratings was verified through correlations with prior ratings for a smaller collection of simplified Chinese words (Liu et al., 2011). Demographic factors, i.e., gender and age, appeared to affect the ratings, whereby women and younger participants were associated with earlier AoA relative to men and older participants, respectively. The two gender groups and the two age groups also differed in their respective trajectory of vocabulary development, with women and younger participants having acquired more words during the early years of formal education relative to men and older participants, respectively. Further, gender and age also appeared to moderate the association between AoA and word frequency, which has been a point of debate among researchers (e.g., Brysbaert, 2017; Zevin & Seidenberg, 2004). Finally, AoA ratings accounted for an additional portion of variance in word recognition, above and beyond word frequency, as measured by response time and error rate. Follow-up analysis on words of different levels of frequency suggested that the contribution of AoA lies largely in the processing of low-frequency words.

In the process of verifying the validity of our ratings, we found that adults’ AoA ratings were moderately correlated (approximately .43) with AoA assessments based on children’s naming performance. As discussed earlier, this could be due to different task requirements. AoA ratings collected from adults were based on self-assessment of word comprehension, whereas AoA assessments collected from children were based on word production. We suspect that this comprehension versus production discrepancy might be especially prominent in Chinese relative to in English. The Chinese lexicon contains one-character, two-character, and multi-character words. Often, a same referent can be denoted by either a one-character or two-character word, and this appears especially the case for early-acquired concepts. For example, both 牙 and 牙齿 mean “tooth,” and both 跑 and 奔跑 mean “run.” That is, it is reasonable to expect that children first learn to produce single-character words before two-character words to name objects and actions. Even after having learned both, they are probably more likely to produce single-character words instead of their two-character counterparts. Brysbaert and Biemiller (2017) found that AoA ratings and test-based AoA assessments were highly correlated for English words, r (18,139) = .757, which is in line with the above speculation that the comprehension versus production discrepancy may be less pronounced in English. However, our analysis was based on a limited word sample due to the lack of large scale test-based AoA collections. Future research needs to further investigate this potential cross-language difference.

The gender and age differences revealed in this study may have important practical and theoretical implications. For example, the difference in growth curve seems to provide empirical evidence for gender-specific pathways of vocabulary development. This is important evidence, especially when we consider that the difference was revealed by men’s and women’s retrospective self-assessment. In addition, the finding that gender moderated the association of word frequency and AoA suggests that the amount of word exposure might have different effects on men versus women. Lastly, the prominence of the growth curve difference between two age groups highlights the need to take into consideration not only subject variables that capture individual differences but also socioeconomic variables that represent societal and cultural changes when we conduct research on language development.

With regard to lexical processing, the role of AoA ratings in word recognition was found to be consistent between Chinese and English words. That is, it explained an amount of variance in lexical processing in addition to the most prominent predictor, word frequency. Further, our analysis suggests that the effects of AoA in lexical processing lie mainly in the processing of low-frequency words. This finding appears to echo the effects of word prevalence (i.e., the extent to which a word is known or used in the population) and word frequency as reported by Keuleers, Stevens, Mandera, and Brysbaert (2015), wherein word prevalence was a better measure for uncommon words and predictive of speed of visual word recognition, and word frequency a better measure for well-known words. Intuitively, AoA and word prevalence may be associated with each another in that a word that is acquired early by many should be known and used by many. On the other hand, there seem to be important distinctions between them. For example, AoA is probably more closely associated with one’s natural language development trajectory, which may be particularly the case for early-learned words. Word prevalence may be more likely to be influenced by socioeconomic and cultural factors. As word prevalence data are currently unavailable in Mandarin Chinese, further investigation is needed to explore the relationship between AoA and word prevalence, and more importantly, the moderating effect of word frequency on the predictive power of AoA and word prevalence in word recognition. Lastly, a noteworthy finding from comparing outcomes of the regression models indicated that frequency, word complexity, number of syllables or characters, similarity to other words, and AoA ratings explained a much smaller amount of variance in Chinese word recognition (slightly over 42% in response time and 22% in error rate) relative to English word recognition (over 60% in response time and over 40% in error rate, Kuperman et al., 2012). In addition, the percentages of our regression models were consistent with those revealed by Tsang et al. (2018) conducted with Chinese words, even though Tsang et al.’s analysis was conducted with a greater number of words, including single-character words. Although these differences in outcomes could be partially due to smaller word samples in Tsang et al. and the present study relative to that of Kuperman et al, given the prominent differences between the Chinese language and the English language, more research seems necessary to explore the unique factors and construct more efficient models of Chinese lexical processing.

There are a number of limitations with regard to the ratings provided in this study, which we hope to address in the future. First, as indicated before, we excluded raters aged 55 and older, and thus caution colleagues in the field of geriatric research to use these ratings carefully. Second, two-character words account for the largest portion of the Chinese lexicon. For example, the LCMC (Lancaster Corpus of Mandarin Chinese; McEnery & Xiao, 2004) and the ToRCH (Text of Recent CHinese; Xu, 2017) each contain about 70% two-character words. Our word sample accordingly consisted of largely two-character words, which seem to be the focus of many psycholinguistic studies (e.g., Du, Zhang, & Zhang, 2014; Wang et al., 2016, Wang et al., 2017; Zhang et al., 2004). Although we included all three- and four-character words in MELD-SCH, a more comprehensive word sample including characters or single-character words would be beneficial to a wider range of research areas, e.g., natural language processing. Third, we intended to recruit raters of varying age and educational background. However, participants were mostly adults, and as is common to most online studies, college graduates appeared to be overrepresented. As cautioned earlier, due to infantile or childhood amnesia, adults’ ratings tend to underestimate vocabulary knowledge acquired early in life. In addition, as pointed out by Kuperman et al. and corroborated by our analysis, demand characteristics of the adult rating task could lead to underestimation of AoA by older adults simply because of a wider age range to choose from. The accuracy of adults’ AoA ratings could also be impacted by memory decay and difficulty differentiating comprehension and production, particularly for early-acquired words. We therefore recommend AoA ratings collected from young participants, ideally children’s performance-based AoA estimates, to developmental researchers. With regard to education background, these ratings are likely more reflective of the trajectory of vocabulary development as the product of formal education. Researchers who are interested in language development among people without formal schooling should be aware of the potential differences between their target population and the sample of the present study. Finally, the heterogeneity of the sample should be noted. On the one hand, these variabilities allowed us to analyze gender and age differences in ratings for the purpose of informing future research. On the other hand, the number of valid ratings within each demographic group was relatively small. Given the variability in ratings, we recommend only mean ratings computed based on all raters rather than any subgroups of raters.