Subtlex-pl: subtitle-based word frequency estimates for Polish

@article{Mandera2015SubtlexplSW,
  title={Subtlex-pl: subtitle-based word frequency estimates for Polish},
  author={Paweł Mandera and Emmanuel Keuleers and Zofia Wodniecka and Marc Brysbaert},
  journal={Behavior Research Methods},
  year={2015},
  volume={47},
  pages={471-483}
}
We present SUBTLEX-PL, Polish word frequencies based on movie subtitles. In two lexical decision experiments, we compare the new measures with frequency estimates derived from another Polish text corpus that includes predominantly written materials. We show that the frequencies derived from the two corpora perform best in predicting human performance in a lexical decision task if used in a complementary way. Our results suggest that the two corpora may have unequal potential for explaining… Expand
SUBTLEX-CAT: Subtitle word frequencies and contextual diversity for Catalan
TLDR
Two lexical decision experiments revealed that the subtitle-based metrics outperformed the previously available frequency estimates and were better predictors than the ones obtained from films and fiction TV series alone. Expand
Shabd: A psycholinguistic database for Hindi.
TLDR
The Shabd database is presented, a psycholinguistic database in Hindi based on a corpus of 1.4 billion words from electronic newspapers and news websites and it is observed that word frequency accounts for as much variance as contextual diversity (operationalized as the number of documents in which the words were observed). Expand
Word frequency counts: Linking corpus data to user’s perception in linguistic research
Abstract Lexical frequency is one of the major variables involved in language processing. It constitutes a cornerstone of psycholinguistic, corpus linguistic as well as applied research. LinguistsExpand
On the predictive validity of various corpus-based frequency norms in L2 English lexical processing
TLDR
Compared the predictive power of a large set of corpus-based frequency norms for the performance of an L2 English visual lexical decision task (LDT), it showed that the frequency norms from SUBTLEX-US and WorldLex–Blog tended to predict L2 performance better in reaction times, whereas the frequencynorms from corpora with a mixture of written and spoken genres tended to Predict L2 accuracy better. Expand
Database of word-level statistics for Mandarin Chinese (DoWLS-MAN).
TLDR
The Database of Word-Level Statistics for Mandarin Chinese (DoWLS-MAN) is presented and it is illustrated how multiple schematic representations of the phonological mental lexicon can aid in hypothesis generation, specifically in terms of phonological processing when reading Chinese orthography. Expand
DURATIONAL VARIATION IN POLISH FRICATIVES PROVIDES EVIDENCE FOR HYBRID MODELS OF PHONOLOGY
The neighborhood density of a word is the number of words that sound similar to it. Phonotactic probability is a measure of how typical (for a given language) the phoneme sequences in a word are.Expand
The influence of place and time on lexical behavior: A distributional analysis.
TLDR
A corpus of over 26,000 fiction books was used to show that computational models of language trained on samples of language representative of the language located in a particular place and time can track differences in people's experimental language behavior, and to validate a new machine-learning approach for optimizing language models. Expand
Measuring phonological distance between languages
Three independent approaches tomeasuring cross-language phonological distance are pursued in this thesis: exploiting phonological typological parameters; measuring the cross-entropy of phonologicallyExpand
A plea for more interactions between psycholinguistics and natural language processing research
TLDR
A crowdsourcing study in the Dutch language has resulted in information about how well 52,000 lemmas are known, which is likely to be of interest to NLP researchers and computational linguists. Expand
Social Media and Language Processing: How Facebook and Twitter Provide the Best Frequency Estimates for Studying Word Recognition
TLDR
Frequency computed from social media are currently the best frequency‐based estimators of lexical decision reaction times (up to 3.6% increase in explained variance) and are robust and still substantial when the authors control for corpus size. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 37 REFERENCES
Subtitle-Based Word Frequencies as the Best Estimate of Reading Behavior: The Case of Greek
TLDR
Examination of SUBTLEX-GR, a subtitled-based corpus consisting of more than 27 million Modern Greek words, showed that frequencies estimated from a subtitle corpus explained the obtained results significantly better than traditional frequencies derived from written corpora. Expand
SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles
TLDR
This database of word and character frequencies based on a corpus of film and television subtitles is the first to include information about the contextual diversity of the words and to provide good frequency estimates for multi-character words and the different syntactic roles in which the words are used. Expand
The word frequency effect: a review of recent developments and implications for the choice of frequency estimates in German.
TLDR
It is found that the commonly used Celex frequencies are the least powerful to predict lexical decision times in the German language. Expand
SUBTLEX-ESP: Spanish word frequencies based on film subtitles
TLDR
This study presents a subtitle-based word frequency list for Spanish, one of the most widely spoken languages, and finds that the subtitle frequencies explained 6% more of the variance than the existing written frequencies in lexical decision, and 2% extra in word naming. Expand
Subtlex-UK: A New and Improved Word Frequency Database for British English
TLDR
A new measure of word frequency, the Zipf scale, is introduced, which the authors hope will stop the current misunderstandings of the word frequency effect. Expand
The use of film subtitles to estimate word frequencies
We examine the use of film subtitles as an approximation of word frequencies in human interactions. Because subtitle files are widely available on the Internet, they may present a fast and easy wayExpand
Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English
TLDR
The size of the corpus, the language register on which the corpus is based, and the definition of the frequency measure were investigated, finding that lemma frequencies are not superior to word form frequencies in English and that a measure of contextual diversity is better than a measure based on raw frequency of occurrence. Expand
More than words: Frequency effects for multi-word phrases
Abstract There is mounting evidence that language users are sensitive to distributional information at many grain-sizes. Much of this research has focused on the distributional properties of words,Expand
Contextual Diversity, Not Word Frequency, Determines Word-Naming and Lexical Decision Times
TLDR
It is argued that the results reflect the importance of likely need in memory processes, and that the continuity between reading and memory suggests using principles from memory research to inform theories of reading. Expand
Practice Effects in Large-Scale Visual Word Recognition Studies: A Lexical Decision Study on 14,000 Dutch Mono- and Disyllabic Words and Nonwords
TLDR
The results show that when good nonwords are used, practice effects are minimal in lexical decision experiments and do not invalidate the behavioral data, which means that large-scale word recognition studies can make use of psychophysical and psychometrical approaches. Expand
...
1
2
3
4
...