Research on word recognition has seen an interesting development in the last two decades. Whereas previously, word recognition was investigated in small-scale studies involving some 100 words divided over a factorial design with a few conditions and evaluated with analysis of variance, the new development consisted of collecting word processing times for thousands of words and analyzing them with regression analysis whenever a variable of interest is better represented continuously rather than categorically. Such studies are often called megastudies. Table 1 gives an overview of the megastudies available.

Table 1 Word processing megastudies published so far, listed in chronological order for the various languages tested (limited to studies with 900 word types or more)

Balota, Yap, Hutchison, and Cortese (2013) and Keuleers and Balota (2015) summarized the advantages of the megastudy approach. First, they listed the disadvantages of the factorial approach. These are:

  • The difficulty to equate the stimuli in the conditions.

  • The fact that many words with a shared feature are presented in a short experiment, which may give rise to context effects.

  • The fact that continuous variables are categorized (e.g., divided into high vs. low).

  • The fact that the study is limited to stimuli at the extremes of a word characteristic.

  • The danger of experimenter bias when selecting words for the various conditions.

The disadvantages of the factorial design are less of an issue in the megastudy approach, because the various control variables can be entered in the regression analysis, participants see a random selection of words, continuous variables are not categorized, and there is no prior stimulus selection by the experimenter (for the last aspect, see also Liben-Nowell, Strand, Sharp, Wexler, & Woods, 2019). Additional advantages of the megastudy approach are:

  • More power due to the large number of stimuli.

  • The data can be used multiple times to address new questions.

  • The relative importance of existing word characteristics can be assessed.

  • The impact of a variable can be studied across the entire range.

  • The strength of a new, theoretically important variable can be evaluated; the data can also be used to search for new variables.

  • The quality of newly presented computational models can be evaluated.

  • The quality of competing metrics (e.g., word frequency norms) can be compared.

  • If the megastudy includes many participants in addition to many stimuli, individual differences can be studied.

The new possibilities can be illustrated with the English Lexicon Project (ELP; Balota et al., 2007), consisting of lexical decision and naming times for over 40 thousand English words. In several studies, the dataset has been used to examine the relative importance of word features, such as frequency, length, similarity to other words, part of speech, age of acquisition, valence, arousal, concreteness, and letter bigrams (e.g., Brysbaert & Cortese, 2011; Kuperman, Estes, Brysbaert, & Warriner, 2014; Muncer, Knight, & Adams, 2014; New, Ferrand, Pallier, & Brysbaert, 2006; Schmalz & Mulatti, 2017; Yap & Balota, 2009). It has also been used to test new variables, such as OLD20 (Yarkoni, Balota, & Yap, 2008), the consonant–vowel structure of words (Chetail, Balota, Treiman, & Content, 2015), and word prevalence (Brysbaert, Mandera, McCormick, & Keuleers, 2019). It has been valuable to test mathematical models of word recognition and individual differences (Yap, Balota, Sibley, & Ratcliff, 2012), to understand how compound words are processed (Schmidtke, Kuperman, Gagné, & Spalding, 2016), to study the influence of semantic variables on word recognition (Connell & Lynott, 2014), to find the best frequency measure for English words (Brysbaert & New, 2009; Gimenes & New, 2016; Herdağdelen & Marelli, 2017), to test new computational models (Norris & Kinoshita, 2012), and to predict word learning in speakers of English as a second language (Berger, Crossley, & Kyle, 2019).

To ensure the usefulness of the ELP, it is important to check for converging evidence from other, independent sources. This motivated Keuleers, Lacey, Rastle, and Brysbaert (2012) to compile the British Lexicon Project (BLP), consisting of lexical decisions to 28,000 monosyllabic and disyllabic words. Other interesting additions were the collection of auditory lexical decision times (Goh, Yap, Lau, Ng, & Tan, 2016; Tucker et al., 2019) and semantic decision times (Pexman, Heard, Lloyd, & Yap, 2017).

In the present article, we discuss the development of a new large English database of word-processing times (there are large databases for other languages as well, as can be seen in Table 1). The present database is the result of a crowdsourcing project (Keuleers, Stevens, Mandera, & Brysbaert, 2015) that was not primarily set up to analyze response times. Because previous research showed that the collection of reaction times in a web browser can be accurate enough to be a useful method for behavioral research (Crump, McDonnell, & Gureckis, 2013; Reimers & Stewart, 2015), we will examine to what extent the response times from such a paradigm inform us about the ease of word recognition.

Method

Keuleers and Balota (2015) defined a crowdsourcing study as a study in which data are collected outside of the traditional, controlled laboratory settings. The English Crowdsourcing Project (ECP), which is presented here, is part of a series of internet-based vocabulary tests developed at Ghent University, in which participants have to indicate which of the presented stimuli they know as words. The vocabulary tests were started in 2013 in Dutch (Keuleers et al., 2015). The English test started in 2014 (Brysbaert, Stevens, Mandera, & Keuleers, 2016a) and is still running (available at http://vocabulary.ugent.be/). Its main goal was to get an idea of how well words are known in the population, a variable we call word prevalence (Brysbaert et al., 2019; Brysbaert, Stevens, Mandera, & Keuleers, 2016b; Keuleers et al., 2015).

The exact instructions of the ECP vocabulary test are:

In this test you get 100 letter sequences, some of which are existing English words (American spelling) and some of which are made-up nonwords. Indicate for each letter sequence whether it is a word you know or not. The test takes about 4 min and you can repeat it as often as you want (you will get new letter sequences each time). If you take part, you consent to your data being used for scientific analysis of word knowledge. Do not say yes to words you do not know, because yes-responses to nonwords are penalized heavily!

Per test, participants received 70 words and 30 nonwords. We expected average participants to know about 70% of the presented words, so we corrected for response bias by presenting around one third of the stimuli as nonwords. To discourage guessing, participants were warned that they would be penalized if they responded “word” to nonword stimuli. At the end of the test, participants received an estimate of their vocabulary size, which was a big motivation for them to take part and to recommend the test to others. The presented estimate was computed by subtracting the percentage of word responses to nonwords (false alarms) from the percentage of word responses to words (hits).

The yes/no format with guessing correction is an established form of vocabulary testing in the language proficiency literature (Ferré & Brysbaert, 2017; Harrington & Carey, 2009; Lemhöfer & Broersma, 2012; Meara & Buxton, 1987). However, in the ECP the presented words and nonwords were not fixed like in a regular vocabulary test.

The words were selected from a set of 61,851 English wordsFootnote 1 compiled over the years. These words included the lemmas and high-frequency irregular word forms from the SUBTLEX databases, supplemented with stimuli from dictionaries and spelling checkers. Figure 1 shows the distributions of word length, word frequency, and word prevalence in the stimulus list. Word length varied from 1 to 22 letters. Word frequency is expressed as Zipf scores (Brysbaert, Mandera, & Keuleers, 2018), going from 1.29 (not present in the corpus) to 7.62 (the word you). Particularly interesting is the large number of words not observed in the SUBTLEX-US frequency list (or in most other frequency lists) but present in dictionaries and spelling checkers. Many of these are well known, even though they are rarely used in spoken or written language (such as mindfully, rollerblade, submissiveness, toolbar, jumpstart, freefall, touchable, . . . ; see Brysbaert et al., 2019, for more information). Word prevalence ranges from less than – 2 (a word unknown to virtually everybody) to over + 2.33 (a word known by more than 99% of the population).

Fig. 1
figure 1

Overview of the word lengths, word frequencies, and word prevalence values present in the stimulus list

The nonwords were selected from a list of 329,851 pseudowords generated with Wuggy (Keuleers & Brysbaert, 2010). They were constructed to be as similar as possible to the words in terms of length and letter transition probabilities within and across syllables. Because the stimuli presented in the test were not fixed, participants could take the test more than once. Indeed, a few participants took several hundreds of tests over the years.

Specific to the ECP stimulus set is that the vast majority of words consist of uninflected lemma forms. This is different from the BLP, in which about half of the stimuli were inflected forms (the only inclusion criterion was monosyllabic or disyllabic words), and the ELP, which consisted of all words observed in a corpus, including inflected forms and proper nouns (names of people and places).

Although the ECP task involves a yes/no decision, it is important to consider the differences from a traditional lexical decision task. First, at no point were participants told time was an issue. Second, participants were explicitly instructed to only indicate which words they knew, and not to guess if they were unfamiliar with a sequence of letters. Participants did the test outside of a university setting and did it because they wanted to know their English proficiency level. Still, Harrington and Carey (2009) noticed that under these conditions the response times (RTs) can be informative. Because averaging over large numbers reduces the noise in the individual observations, the worth of RTs is expected to increase with the number of participants taking part.

Before the start of the test, participants were asked a few basic questions. These were (1) what their native language was, (2) where they grew up, (3) what the highest degree was they obtained or were working towards, (4) their gender and age, (5) how many languages they spoke in addition to English and their mother tongue, and (6) how good their knowledge of English was. Participants were not required to provide this information before they could take part, but the vast majority did.

Results and discussion

The data used in the present article are based on all the tests taken between January 2014 and September 2018. During that period we collected more than 142 million answers from 1.42 million experimental sessions.

For the analyses of the present article, we used the following data-pruning pipeline (run entirely before looking at the data; nothing was changed as a result of the analyses).Footnote 2

  1. 1)

    We only took into account the word data. This reduced the dataset from 142 million to 99.5 million.

  2. 2)

    We only used the first three sessions from each IP address, to make sure that no individual had an undue influence (some participants did hundreds of sessions). This reduced the dataset to 93.6 million observations.

  3. 3)

    We deleted the first nine trials of each session, which were considered training trials, leaving us with 84.3 million observations.

  4. 4)

    RTs longer than 8,000 ms were deleted, so that no dictionary consultation could take place. This reduced the dataset to 83.5 million observations.

  5. 5)

    Outliers were filtered out on the basis of an adjusted boxplot method for positively skewed distributions (Hubert & Vandervieren, 2008) calculated separately for the words in each individual session, leaving 79.0 million observations.

  6. 6)

    Sessions with more yes-responses to nonwords than to words were omitted (often people pressing the wrong buttons), further reducing the dataset to 78.7 million data points.

  7. 7)

    Finally, only data from users with English as native language who answered the person-related questions were retained. This reduced the final dataset to 41.2 million observations coming from almost 700 thousand sessions.

For 47% of the sessions, responses were collected from a device with a touchscreen; in the other sessions, responses were given on a keyboard. In the touch interface, responses were made using virtual YES and NO buttons; in the keyboard interface, the F key was used for the “no” response, and the J key for the “yes” response.Footnote 3

About 60% of the participants grew up in the United States, 22% in the United Kingdom, and the remaining 18% in other countries. All words had American spellings (e.g., labor, center, analyze).Footnote 4

Per word, there were on average 666 observations in the resulting subset of the data, going from a minimum of 190 to a maximum of 7,895. The reasons for these deviations are twofold. First, we received feedback from the users that our initial list contained too many nonexisting adverbs (lucklessly, felinely) and non-existing nouns ending on -ness (gingerliness, gelatinousness). These were pruned, together with some other letter sequences that created confusion (such as compound words written as a single word—clairsentience, taylormade—and the letters of the alphabet). At that time we also entered new words we had come across since the start of the project, which explains why the minimum number of responses is only 190. The high maximum number of responses was due to two occasions on which the randomization algorithm blocked. As a result, the same sequence was presented repeatedly, until we were alerted to the problem. Because of these infelicities, cautious users may want to exclude entries with less than 316 observations (N = 2,544) or more than 1,000 observations (N = 140), although we do not think these RTs are problematic and we did not exclude them from the analyses presented here.

RTs were calculated on correct trials only. RTs were defined as the time interval between the presentation of the stimulus and the response of the participant. Overall accuracy was .78. Mean RT was 1,297 ms (SD over stimuli is 357). The mean standard deviation in RTs per stimulus across participants was 784 ms (SD over stimuli is 264). Both values are considerably higher than in laboratory based megastudies. For comparison: in the lexical decision part of ELP the mean RT for the words was 784 ms (SD = 135), and the mean standard deviation of the LDT latencies was 278 ms (SD = 92; Balota et al., 2007).

Correlations with data from other megastudies

A first way to measure the merit of the RTs in ECP is to correlate them with the RTs from other megastudies. The prime candidate, of course, is ELP, with its lexical decision RTs and naming latencies. Next is the BLP, also providing lexical decision times. For both databases, we used standardized RTs (zRTs), because they correlate more with word characteristics. There was no need to work with standardized RTs for the ECP, since the correlation between raw RTs and zRTs was r = .992. The reasons for the high correlation are the large number of observations per word (several hundred, as compared to the 30–40 observations per word in ELP) and the fact that each participant added only a tiny fraction of the data. Raw RTs are easier to understand, because they are closer to human intuitions and they retain individual differences in RTs (but see below for some analyses with zRTs).

We also excluded words that had an accuracy of less than .85 in ECP, since the RTs of these words are less trustworthy.Footnote 5 This left us with a total of 12,001 words for which we had RTs in all databases. Because of the design of the BLP, the observations are limited to monosyllabic and disyllabic words (the words most often used in experimental research). Table 2 gives the correlations between the databases. As can be seen, for this particular dataset, ECP correlates almost as much with ELP lexical decision times (ELPLDT) as the BLP correlates with the same database. This is good news for the value of ECP.

Table 2 Correlations between the RTs of English Crowdsourcing Project (ECP), English Lexicon Project (ELP), and British Lexicon Project (BLP) for the items in common that were generally known (N = 12,001)

A second way to examine the usefulness of the ECP RTs is to see how well they correlate with the RTs from the other studies mentioned in Table 1 and, more importantly, how the correlations compare to those with ELPLDT and BLP. Table 3 lists the findings for some classic datasets.

Table 3 Correlations of the ECP, ELP, and BLP RT data with other datasets

As can be seen in Table 3, the ECP RTs correlated .79 with the standardized ELP lexical decision times, and .73 with the BLP zRTs. These correlations can be considered as the bottom level of reliability for the dataset (based on convergent validity), indicating that some 75%–80% of the variance in ECP times is systematic variance that can be explained by stimulus characteristics. As for the correlations with the other datasets, ECP seems to be slightly worse than ELP (in particular for short words) and on par with BLP.

Variance accounted for by word characteristics

A third way to gauge the quality of the ECP dataset is to see how strongly the RTs are influenced by word characteristics. In a recent article, Brysbaert et al. (2019) evaluated the contributions of seven variables on ELP zRTs.Footnote 6 These were:

  • Word frequency (SUBTLEX-US; Brysbaert & New, 2009)

  • Word length (in letters)

  • Word length (in syllables)

  • Number of morphemes (from Balota et al., 2007)

  • Orthographic distance to other words (OLD from Balota et al., 2007)

  • Phonological distance to other words (PLD from Balota et al., 2007)

  • Age of acquisition (Kuperman, Stadthagen-Gonzalez, & Brysbaert, 2012)

  • Concreteness (Brysbaert, Warriner, & Kuperman, 2014)

Table 4 compares the regression analysis for the words in common between ELP and ECP (N = 18,305; the words dropped from the analyses in Table 3 were words for which we did not have information on all variables and words not recognized by 75% of the ELP participants, the criterion used by Brysbaert et al., 2019). For ease of comparison, the regression weights are expressed as beta coefficients, meaning that the dependent and independent variables were standardized. Figures 2 and 3 give a graphical display of the effects.

Table 4 Outcomes of regressions on the ELP lexical decision time zRTs (ELPzLDT) and the ECP RTs for the words in common (N = 18,305)
Fig. 2
figure 2

Effects of the variables on the standardized ELP lexical decision times. The first row shows the effects of word frequency and length in letters; the second shows those of number of syllables and number of morphemes; the third shows those of orthographic and phonological similarity to other words; and the last row shows the effects of age of acquisition and concreteness

Fig. 3
figure 3

Effects of the variables on the ECP word recognition times. The first row shows the effects of word frequency and length in letters; the second shows those of number of syllables and number of morphemes; the third shows those of orthographic and phonological similarity to other words; and the last row shows the effects of age of acquisition and concreteness

As can be seen in Table 4 and Figs. 2 and 3, the effects of the word variables were quite comparable in the lexical decision parts of ELP and ECP. High-frequency words were responded to faster than low-frequency words, except for the very-high-frequency words, which are mostly function words (auxiliaries, conjunctions, determiners, particles, prepositions, or pronouns). Function words do not seem to be expected in lexical decision experiments or vocabulary tests, possibly because they are rarely seen in isolation, or because of list context effects, since the vast majority of stimuli presented in lexical decision tasks are content words. Indeed, the processing cost for these words is not seen in eye movement studies (Dirix, Brysbaert, & Duyck, 2018).

Words of six to eight letters were responded to faster than longer and shorter words; the effect was very much the same in ECP and ELP. Words with extra syllables were responded to more slowly and morphologically complex words were responded to more rapidly than expected on the basis of the other variables. These effects were stronger in ELP than in ECP. Also the similarity to other words tended to have a stronger effect in ELP than in ECP. Here we see the only contradiction between ELP and ECP: Whereas orthographic distance to other words hindered processing in ELP, it facilitated processing in ECP. Finally, the effects of age of acquisition (AoA) and concreteness were larger in ECP than in ELP.

All in all, the variables related to the activation of representations in the mental lexicon (frequency, AoA, concreteness) were stronger in ECP than in ELP. In contrast, the variables related to the similarity with other words (morphology, orthographic, and phonological similarity) tended to weigh more heavily in the speeded responses of ELP than in the unspeeded responses of ECP. Interestingly, words were responded to more slowly in ECP when they were orthographically similar to other words, whereas the reverse effect was observed in ELP. The ECP finding is in line with the hypothesis that it is more difficult to recognize a word when it resembles many other words. The ELP finding is in line with the proposal that speeded responses in a lexical decision task are not always based on individual word recognition, but can be based on the total degree of orthographic activation caused by the letter string (Grainger & Jacobs, 1996; Pollatsek, Perea, & Binder, 1999).

The regression accounted for 68% of the variance in ELP zRTs and 60% of the variance in ECP RTs. The correlation between ELP and ECP was .79 for the dataset. This is the same as for all words in common (Table 3), and it means that we are still missing some 11%–19% of the systematic variance in the datasets.

Virtual experiments

A final way to probe the value of ECP is to see whether we can replicate some classic studies with the dataset. This is done by extracting the RTs from ECP for the stimuli used in the original experiments and running analyses over items. Keuleers et al. (2012) ran a number of such virtual experiments with BLP. The first question they addressed was whether the word frequency effect could be replicated. Given that ECP has a stronger frequency effect than ELP, we would expect this to be the case. Table 5 shows the outcome. To ease the comparison, the ELP and BLP data are given as average RTs and not as zRTs.

Table 5 Virtual experiments on the frequency effect (if needed, British spellings were replaced with American spellings)

The next variable Keuleers et al. (2012) investigated was AoA. Given that the AoA effect was stronger in ECP than in ELP, we again expected to replicate the findings. Table 6 shows the results. We indeed were able to replicate the published patterns. In particular for Gerhand and Barry (1999) the virtual experiment was closer to the original experiment than ELP and BLP, partly because there were several missing observations for the hardest condition in ELP and BLP.

Table 6 Virtual experiments on the age-of-acquisition (AoA) effect (if needed, British spellings were replaced with American spellings)

Another topic Keuleers et al. (2012) addressed was orthographic neighborhood size. The first computational models suggested that words with many neighbors should take longer to process, because there is more competition between activated word forms. A series of lexical decision experiments pointed to facilitation, however, which Grainger and Jacobs (1996) explained by assuming that lexical decision responses can be based on the total activation in the mental lexicon. Words with many neighbors initially create more activation in the lexicon than words with few neighbors and this would lead to a “word” response before the target word is fully recognized.

Given that the OLD effect in ECP was opposite to the one observed in ELP, it is interesting to see what virtual experiments give for this variable. Table 7 shows the results for some classic studies. Remember that these all involved monosyllabic words, a very small subset of the words in ECP. Although the results of the virtual experiments are largely in line with those of the original studies (including those of ECP), Table 7 is primarily a testimony to the weaknesses of the factorial design, as listed in the introduction. Most studies had too few stimuli to find anything significant in an analysis over stimuli, meaning that the differences could have been due to one or two stimuli in one or the other condition. Overall, however, it looks like the effects of neighborhood size are facilitatory in lexical decision (in particular, the number of body neighbors), and that inhibitory effects are largely due to the presence of a neighbor with a higher frequency (see also Chen & Mirman, 2012). In addition, neighbors are not limited to words of the same length, but include words with one letter omitted or added (Davis & Taft, 2005), as captured by the OLD and PLD measures. More importantly for the present discussion, the ECP findings are well in line with those of the other data for the monosyllabic words.

Table 7 Virtual experiments on orthographic neighborhood effects (if needed, British spellings were replaced with American spellings)

In a series of articles, Yates and colleagues argued that, in particular, phonological neighbors speed up lexical decisions (Yates, 2005, 2009; Yates, Locker, & Simpson, 2004). Table 8 looks at how well these findings replicate in ECP, ELP, and BLP. The basic finding of Yates et al. was replicated successfully with the stimuli selected by the authors, but the difference between two and three phonological neighbors (Yates, 2009) was less consistent. This agrees with Davis’s (2010) argument that the main neighborhood size effect is between no neighbors and one neighbor (with higher frequency).

Table 8 Effects of phonological neighborhood size in published lexical decision experiments and in virtual experiments with the same stimuli

Another effect worth looking at is the influence of word ambiguity. Rodd, Gaskell, and Marslen-Wilson (2002) argued that ambiguity has two opposite effects. Words with unrelated meanings (e.g., can, second) have longer lexical decision times than unambiguous control controls, whereas words with related senses (uniform, burn) are responded to more rapidly than unambiguous control words. Table 9 shows that the facilitatory effect of multiple senses tends to be stronger than the inhibitory effect of multiple unrelated meanings, and that the effects seem to be clearer in ELP than in ECP, at least for the stimuli selected by Rodd et al.

Table 9 Effects of word ambiguity in published lexical decision experiments and in virtual experiments with the same stimuli

A final finding in lexical decision research we will look at is the size effect. Sereno, O’Donnell, and Sereno (2009) reported that participants respond faster to words representing big things (bed, truck, buffalo) than to matched words representing small things (cup, thumb, apricot). The authors related this finding to the importance of embodied cognition, a view according to which cognitive processing involves internal simulations of perceptual and motor processes (Barsalou, 2008; Fischer & Zwaan, 2008). Kang, Yap, Tse, and Kurby (2011), however, were unable to replicate the finding and, in addition, reported that the effect was absent in ELP. Table 10 gives the outcome of a virtual experiment in ECP, in addition to ELP. BLP could not be used, because nearly half of the stimulus materials were longer than two syllables. As can be seen in Table 10, the size effect was not replicated in ECP, either.

Table 10 Effects of concept size in Sereno et al. (2009) and in virtual experiments with the same stimuli

Education differences

Up to now we have discussed findings that ECP has in common with ELP and BLP, and seen that for these words ECP is a valid addition to the existing megastudies. However, the merit of ECP goes further. For a start, ECP offers data for 35 thousand words not covered by ELP, and for 50 thousand words not present in BLP. This substantially increases the resources available to researchers.

In addition, ECP includes more participants than the typical undergraduate students. Some participants had only finished high school, others had achieved a bachelor degree (often outside university), a master degree (at university), or a PhD degree. On average, we had 170 observations per word for participants who finished high school, 296 for participants with a bachelor degree, 125 for participants with a master degree, and 46 for participants with a PhD degree. Because of the small numbers in the last group, we limit the analysis to the first three groups.

Keuleers et al. (2015) and Brysbaert et al. (2016a) already discussed the number of words known as a function of education level. Participants with more education know more words than participants with less education. Interestingly, the differences were modest when the participants’ age was taken into account and mainly originated during the study years, arguably because the participants then were acquiring the academic vocabulary related to their studies and word use in higher education (Coxhead, 2000).

To compare the three education groups, we report the outcomes of the regression analysis with the data discussed in Table 4 (N = 18,305). Two outcomes are given: first the analysis with the unchanged regression weights, and then the analysis with the beta coefficients. The former tells us how the RTs differ between groups, the latter how the relative importance of the variables varies. Variables were centered, so that the intercept gives us the RT of the “middle” word. Interestingly, the ELP zRTs correlate highest with the participants who finished high school (r = .79), followed by those who had bachelors degrees (r = .77), and lowest with the participants who had a masters degree (r = .71). This is in line with the fact that most ELP participants were undergraduate students. On the other hand, the lower correlation with the masters degree group is probably also to some extent due to the lower number of observations for this group (resulting in a lower reliability of the ECP RTs).

Table 11 shows the outcomes of the analyses. Participants with less education responded more slowly, as can be seen in the intercepts, and tended to show stronger effects of frequency, AoA, and number of syllables. Participants with masters degrees seem to be more willing to respond on the basis of total orthographic activation, given that the effect of OLD is stronger for them. Overall, however, the differences are small and do not seem to offset the smaller number of observations per word. In particular, R2 for the participants with a masters degree has dropped considerably (R2 = .48).

Table 11 Outcomes of regression analyses for the three education groups of ECP (high school, bachelor, master) for the words in common with ELP (N = 18,305)

Figure 4 shows how the predicted RTs differ for the three education groups as a function of word frequency. This illustrates that the effect of education is particularly strong for low-frequency words.

Fig. 4
figure 4

Predicted response times for the three education groups as a function of word frequency. The regressions included all variables mentioned in Table 11

Age differences

Another variable we can look at is the age group of the participants (Wulff et al., 2019). Davies, Arnell, Birchenough, Grimmond, and Houlson (2017) reported that the effects of word frequency and AoA on lexical decision times become smaller with increasing age over adult life. At the same time, there was ageing-related response slowing, which could be attributed to decreasing efficiency of stimulus encoding and/or response execution in older age, but is also consistent with increased processing costs related to the accumulation of information learned over time (Ramscar, Hendrix, Shaoul, Milin, & Baayen, 2014).

The decrease of the word frequency effect in older participants is expected on the basis of their longer exposure to the language. A number of publications indicate that the word frequency effect becomes smaller as participants are exposed to more language (Brysbaert, Lagrou, & Stevens, 2017; Brysbaert et al., 2018; Cop, Keuleers, Drieghe, & Duyck, 2015; Diependaele, Lemhöfer, & Brysbaert, 2013; Mainz, Shao, Brysbaert, & Meyer, 2017; Mandera, 2016, chap. 4; Monaghan, Chang, Welbourne, & Brysbaert, 2017). This is expected on the basis of two models. First, connectionist models at a certain point show a decrease in the frequency effect, when overlearning takes place (Monaghan et al., 2017). Second, Mandera (chap. 4) showed that a decrease in the frequency effect as a function of practice is predicted if word learning follows a power law rather than an exponential law (Logan, 1988).

At the same time, exposure to language increases the vocabulary of a person. Healthy old participants indeed have a larger vocabulary than young adults (Verhaeghen, 2003) and vocabulary has been shown to have logarithmic growth over age (Keuleers et al., 2015). Particularly related to the ECP stimulus set, Brysbaert et al. (2016a) reported that a 60-year-old person on average knows 6,000 lemmas more than a 20-year-old person, or an increase of some three words per week.

In contrast to Davies et al. (2017) and our own work, Cohen-Shikora and Balota (2016) failed to find a decrease in the word frequency effect as a function of age. They administered three tasks (lexical decision, word naming, and animacy judgment) to 148 participants, ranging in age from 18 to 86 years. Each task consisted of responses to 400 words (in counterbalanced order). Only in word-naming latencies was there a hint of a smaller word frequency effect in older participants than in younger participants. At the same time, the data of Cohen-Shikora and Balota (2016) replicated the core effects of the other studies: (1) Older participants were slower and more accurate than younger participants, (2) older participants had a larger vocabulary than younger participants, and (3) there was a negative correlation between vocabulary size and the word frequency effect. The analyses of Cohen-Shikora and Balota were done on the z scores of RTs. Could this have made a difference, as z scores not only eliminate differences in means but also equalize the standard deviations?

To compare age groups, we made a distinction between participants of 18–23 (on average, 104 observations per word), 24–29 (117 observations), 30–39 (150 observations), 40–49 (106 observations), and 50+ years (124 observations). To see whether our age differences were in line with those of Spieler and Balota (1997; young participants) and Balota and Spieler (1998; old participants), we looked at the correlations with these datasets. For the young participants of Spieler and Balota, the correlations with increasing age group were: .60, .59, .58, .52, and .50. For the old participants of Balota and Spieler the correlations were, respectively, .51, .50, .53, .48, and .49. The pattern of result was as expected for the young participants, but not for the old participants. One reason may be that the old participants of Balota and Spieler had a mean age of 74 years, substantially older than the ECP participants. Another contributor probably is differences in the reliability of the word processing estimates in the various age groups.

Table 12 and the left panel of Fig. 5 show the results of the regression analyses. They are in line with the observation of Davies et al. (2017) that the frequency and AoA effects decrease over age. The OLD and PLD effects also seem to become smaller, in line with the observation that the older participants took some more time to respond. Finally, it looks like the effects of number of syllables and concreteness increase as adults grow older.

Table 12 Outcomes of regression analyses for the five age groups in ECP for the words in common with ELP (N = 18,305)
Fig. 5
figure 5

Predicted response times for the five age groups as a function of word frequency: (Left) Raw RTs. (Right) zRTs. The regressions included all variables mentioned in Table 12.

The left panel of Fig. 5 shows the predicted RTs for the five age groups as a function of word frequency. These point to longer response times for older participants. At the same time, because the cost for low-frequency words is smaller for older participants, the age differences in RTs are smallest for the low-frequency words.

To make sure that our results did not rely on the use of raw RTs as the dependent variable, we also analyzed the standardized RTs. As can be seen in the right panel of Fig. 5, the findings remained the same. Because zRTs eliminate differences in average RTs, they more clearly illustrate the smaller frequency effect in older than in younger participants.

One reason for the difference in findings between Fig. 5 and Cohen-Shikora and Balota (2016) could be that Cohen-Shikora and Balota were very careful to equate their groups on education level. It is possible that in our study we had relatively more educated older people than younger people (e.g., because they have more access to the internet).Footnote 7 To test this possibility, we compared the age group of 24- to 30-year-olds to the age group of 50- to 59-year-olds for the participants with high school, bachelors, and masters education, once with the data analyzed as in Fig. 5 and once with equal weights given to the three education levels. The age group 24–29 was chosen because it is the youngest group for which masters-level education is possible; the group 50–59 was chosen because it is the oldest group with comparable homogeneity. To optimize the comparison with Cohen-Shikora and Balota, we used zRTs as the dependent variable.

Figure 6 shows the outcome. There is no evidence that the smaller frequency effect in the 50–59 group is due to differences in education level (something we did not see in the distribution of education levels in the two age groups, either). So, our data agree more with those of Davies et al. (2017) than with those of Cohen-Shikora and Balota (2016). A further challenge for the interpretation of Cohen-Shikora and Balota is how to square the absence of a correlation between age and frequency with the presence of significant correlations between vocabulary size and frequency effect, on the one hand, and between vocabulary size and age, on the other hand. To the defense of Cohen-Shikora and Balota, their study is the only one to have included several word-processing tasks, a large group of participants with ages above 60 years, and extensive attempts to match the groups of participants. On the negative side, they included words in a more restricted frequency range than we did (Zipf scores going from roughly 2.0 to 5.2, with a mean of 3.6). This might have made it more difficult to see the interaction.

Fig. 6
figure 6

Comparison of frequency effects in the age groups of 24–30 and 50–59 years, when not controlled for possible differences in education level (left) and when controlled for such differences (right). Effects are calculated on zRTs. The regressions included all variables mentioned in Table 12.

Conclusions

We present a new word dataset, the English Crowdsourcing Project, which is larger than all currently available datasets (Table 1). It is larger both in the number of words included and the number and variety of participants taking part.

The dataset was collected by means of an internet vocabulary test, in which participants indicated which words they knew and which not. To discourage “yes” responses to unknown words, about one third of the stimuli were nonwords, and participants were penalized if they said “yes” to these nonwords.

Although speed of responding was not mentioned as an evaluation criterion to the participants, the present analyses show that the RTs correlate well with the lexical decision times collected in laboratory settings; they are just some 250 ms longer. Surprisingly, the longer RTs did not lead to larger effects in the virtual experiments. For all the experiments, the effects in ECP were comparable to the original effects and to those in the English Lexicon Project and the British Lexicon Project. This was unexpected, because often longer RTs are accompanied by larger differences between conditions (see, e.g., Table 8 of Keuleers et al., 2012). It suggests that the extra time in ECP was largely unrelated to word recognition and the decision processes (for a model including such a time delay, see Ratcliff, Gomez, & McKoon, 2004). Apparently, participants took extra time to perceive the stimulus and give a response. In this respect, it is important to mention that the RTs in a lexical decision task drop by some 100 ms in the first few hundred trials (Keuleers, Diependaele, & Brysbaert, 2010; Keuleers et al., 2012). Given that most participants completed only 100 trials in ECP, this could explain some of the extra time taken to respond. Another contributor might have been the software used to present the stimuli and collect the responses via the internet.

To some extent, it is surprising that untimed answers to a vocabulary test would resemble lexical decision times so well, when based on large numbers of observations. This testifies to the ecological validity of the lexical decision task, in that very much the same results were obtained in an untimed vocabulary test outside of academia as on a speeded response task in the laboratory.

ECP is further interesting because a large range of people took part. Surprisingly, we found no large differences between education levels (Fig. 4). Presumably this is due to the fact that only people interested in language and with easy access to the internet took part in the test. There is evidence that the size of the frequency effect depends more on the amount of reading and language exposure than on the intelligence or the education level of the participants (Brysbaert, Lagrou, & Stevens, 2017). ECP does point to some interesting effects of age (or language exposure), however. The effects of frequency and age of acquisition seem to become smaller as adults grow older (see also Davies et al., 2017), whereas older people seem to be more affected by the meanings of words (as indicated by the concreteness effect) and by the complexity of a word (the number of syllables). Further, targeted experiments will have to confirm these initial impressions. Such experiments could also try to include an even wider variety of participants.

Availability

The raw data and Excel files containing the most important information can be found at the Open Science Framework webpage https://osf.io/rpx87/ or on our website http://crr.ugent.be/. To facilitate analyses of the full dataset, we have released a Python module for working with the raw data (available at https://github.com/pmandera/vocab-crowd).

The Excel files are included for a broader audience as their usage does not require programming skills. First, we have the master file containing the information calculated across all participants, called English Crowdsourcing Project All Native Speakers. Its outline is shown in Fig. 7.

Fig. 7
figure 7

Outline of the ECP master file, including response times (RTs) based on all native speakers

Column A gives the word. Column B says how many observations there were for that word. Column C gives the response accuracy, indicating the number of observations on which the RTs are based. We would prefer that users not use the information in column C for anything other than the analysis of RTs. In Brysbaert et al. (2019), we present the word prevalence measure, which is better than accuracy (even though it correlates .96 with the accuracies reported here). Word prevalence is given in column D. Columns E–H contain the new information: the ECP RTs and their standard deviations across participants, plus the zRTs and their standard deviations. Finally, for the user’s convenience, column I includes the SUBTLEX-US frequencies, expressed as Zipf values (Brysbaert et al., 2018).

In addition to the master file, we also have files with the data split per education level (ECP education groups), per age (ECP age groups), and per Age × Education Level. Users who want other summary files are invited to make them themselves on the basis of the raw data.