Word knowledge in the crowd: Measuring vocabulary size and word prevalence in a massive online experiment

  title={Word knowledge in the crowd: Measuring vocabulary size and word prevalence in a massive online experiment},
  author={Emmanuel Keuleers and Micha{\"e}l A. Stevens and Paweł Mandera and Marc Brysbaert},
  journal={Quarterly Journal of Experimental Psychology},
  pages={1665 - 1692}
We use the results of a large online experiment on word knowledge in Dutch to investigate variables influencing vocabulary size in a large population and to examine the effect of word prevalence—the percentage of a population knowing a word—as a measure of word occurrence. Nearly 300,000 participants were presented with about 70 word stimuli (selected from a list of 53,000 words) in an adapted lexical decision task. We identify age, education, and multilingualism as the most important factors… 
Is Frequency Enough?: The Frequency Model in Vocabulary Size Testing
ABSTRACT Modern vocabulary size tests are generally based on the notion that the more frequent a word is in a language, the more likely a learner will know that word. However, this assumption has
How Many Words Do We Know? Practical Estimates of Vocabulary Size Dependent on Word Definition, the Degree of Language Input and the Participant’s Age
Based on an analysis of the literature and a large scale crowdsourcing experiment, we estimate that an average 20-year-old native speaker of American English knows 42,000 lemmas and 4,200
How do Spanish speakers read words? Insights from a crowdsourced lexical decision megastudy
Results from a crowdsourced lexical decision megastudy in which more than 150,000 native speakers from around 20 Spanish-speaking countries performed alexical decision task to 70 target word items selected from a list of about 45,000 Spanish words highlight the value of crowdsourced approaches to uncover effects that are traditionally masked by small-sampled in-lab factorial experimental designs.
Vocabulary Knowledge Predicts Lexical Processing: Evidence from a Group of Participants with Diverse Educational Backgrounds
This study examined the relationship between individual differences in vocabulary and language processing performance more closely by using a battery of vocabulary tests instead of just one test, and testing not only university students but young adults from a broader range of educational backgrounds.
26 Vocabulary size seems to be affected by multiple f actors, including those that belong to 27 the properties of the words themselves and those th at relate to the characteristics of the 28
Word prevalence norms for 62,000 English lemmas
Word prevalence predicts word processing times, over and above the effects of word frequency, word length, similarity to other words, and age of acquisition, in line with previous findings in the Dutch language.
Predict ing Word Learning Order in Dutch and Engl ish Using Dif ferent Word Frequencies and Other Word Attr ibutes
In this study Age of Acquisition (AoA) of words is predicted using different word frequency distributions and other word attributes, like word length, word concreteness, lexical decision times, word
Estimating the prevalence and diversity of words in written language
This metric is derived from an analysis of a newly collected corpus of over 25,000 fiction and non-fiction books and will be shown that it is capable of accounting for significantly more variance than past corpus-based measures.
Orthographic Knowledge and Lexical Form Influence Vocabulary Learning.
Investigating the role of sublexical native-language patterns on novel word acquisition suggests that language learners benefit from both native- language overlap and regularities within the novel language.


Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English
The size of the corpus, the language register on which the corpus is based, and the definition of the frequency measure were investigated, finding that lemma frequencies are not superior to word form frequencies in English and that a measure of contextual diversity is better than a measure based on raw frequency of occurrence.
Contextual Diversity, Not Word Frequency, Determines Word-Naming and Lexical Decision Times
It is argued that the results reflect the importance of likely need in memory processes, and that the continuity between reading and memory suggests using principles from memory research to inform theories of reading.
Reassessing word frequency as a determinant of word recognition for skilled and unskilled readers.
It is demonstrated via computational simulations and norming studies that corpus-based word frequencies systematically overestimate strengths of word representations, especially in the low-frequency range and in smaller-size vocabularies, challenging the view that the more skilled an individual is in generic mechanisms of word processing, the less reliant he or she will be on the actual lexical characteristics of that word.
Practice Effects in Large-Scale Visual Word Recognition Studies: A Lexical Decision Study on 14,000 Dutch Mono- and Disyllabic Words and Nonwords
The results show that when good nonwords are used, practice effects are minimal in lexical decision experiments and do not invalidate the behavioral data, which means that large-scale word recognition studies can make use of psychophysical and psychometrical approaches.
SUBTLEX-NL: A new measure for Dutch word frequency based on film subtitles
A new database of Dutch word frequencies based on film and television subtitles is presented, and an accessibility measure based on contextual diversity explains more of the variance in accuracy and RT than does the raw frequency of occurrence counts.
Subtlex-UK: A New and Improved Word Frequency Database for British English
A new measure of word frequency, the Zipf scale, is introduced, which the authors hope will stop the current misunderstandings of the word frequency effect.
Visual word recognition of single-syllable words.
Large-scale regression studies were used to investigate the unique predictive variance of phonological features in the onsets, lexical variables, and semantic variables to investigate visual word recognition, shedding light on recent empirical controversies in the available word recognition literature.
Moving beyond Coltheart’s N: A new measure of orthographic similarity
It is demonstrated that OLD20 provides significant advantages over ON in predicting both lexical decision and pronunciation performance in three large data sets, and interacts more strongly with word frequency and shows stronger effects of neighborhood frequency than does ON.
Norms of valence, arousal, and dominance for 13,915 English lemmas
This work extended the ANEW database to nearly 14,000 English lemmas, providing researchers with a much richer source of information, including gender, age, and educational differences in emotion norms.