On the Advantages of Word Frequency and Contextual Diversity Measures Extracted from Subtitles: The Case of Portuguese

@article{Soares2015OnTA,
  title={On the Advantages of Word Frequency and Contextual Diversity Measures Extracted from Subtitles: The Case of Portuguese},
  author={Ana Paula Soares and Jo{\~a}o Machado and Ana Santos Costa and {\'A}lvaro Iriarte and Alberto Sim{\~o}es and Jos{\'e} Jo{\~a}o de Almeida and Montserrat Comesa{\~n}a and Manuel Perea},
  journal={Quarterly Journal of Experimental Psychology},
  year={2015},
  volume={68},
  pages={680 - 696}
}
We examined the potential advantage of the lexical databases using subtitles and present SUBTLEX-PT, a new lexical database for 132,710 Portuguese words obtained from a 78 million corpus based on film and television series subtitles, offering word frequency and contextual diversity measures. Additionally we validated SUBTLEX-PT with a lexical decision study involving 1920 Portuguese words (and 1920 nonwords) with different lengths in letters (M = 6.89, SD = 2.10) and syllables (M = 2.99, SD = 0… Expand
SUBTLEX-CAT: Subtitle word frequencies and contextual diversity for Catalan
TLDR
Two lexical decision experiments revealed that the subtitle-based metrics outperformed the previously available frequency estimates and were better predictors than the ones obtained from films and fiction TV series alone. Expand
The Minho Word Pool: Norms for imageability, concreteness, and subjective frequency for 3,800 Portuguese words
TLDR
The Minho Word Pool (MWP), a dataset that provides normative values of imageability, concreteness, and subjective frequency for 3,800 (European) Portuguese words—three subjective measures that, in spite of being used extensively in research, have been scarce for Portuguese. Expand
On the predictive validity of various corpus-based frequency norms in L2 English lexical processing
TLDR
Compared the predictive power of a large set of corpus-based frequency norms for the performance of an L2 English visual lexical decision task (LDT), it showed that the frequency norms from SUBTLEX-US and WorldLex–Blog tended to predict L2 performance better in reaction times, whereas the frequencynorms from corpora with a mixture of written and spoken genres tended to Predict L2 accuracy better. Expand
Psycholinguistic variables in visual word recognition and pronunciation of European Portuguese words: a mega-study approach
ABSTRACT An increasing number of psycholinguistic studies have adopted a megastudy approach to explore the role that different variables play in the speed and/or accuracy with which words areExpand
The role of word frequency and contextual diversity in visual word recognition: a mini review
Contextual diversity refers to the number of contexts in which a word appears. It is traditionally believed that word frequency is an important factor affecting lexical access, but the presence ofExpand
Effects of Character and Word Contextual Diversity in Chinese Beginning Readers
ABSTRACT While recent studies find that contextual diversity (CD) is a better determinant of visual word recognition than token frequency, there is a dearth of work comparing contextual diversity andExpand
Procura-PALavras (P-PAL): A Web-based interface for a new European Portuguese lexical database
TLDR
P-PAL is a Web-based interface for a new European Portuguese (EP) lexical database that provides a broad range of word attributes and statistics, and will be a key resource to support research in all cognitive areas that use EP verbal stimuli. Expand
Norming studies for lexicosemantic and affective characteristics of European Portuguese words: A literature review
Words are widely used as stimuli in cognitive and linguistic research. As words may vary on various domains (e.g., lexicosemantic and affective), which can influence performance in many ways, it isExpand
Disentangling the effects of word frequency and contextual diversity on serial recall performance
TLDR
The first independent manipulation of CD and WF in a serial recall task suggests a more difficult episodic retrieval of item information for words of high CD, and a role for both item and order information in the WF effect. Expand
The role of syllables in intermediate-depth stress-timed languages: masked priming evidence in European Portuguese
The role of syllables as a sublexical unit in visual word recognition and reading is well established in deep and shallow syllable-timed languages such as French and Spanish, respectively. However,Expand
...
1
2
3
4
...

References

SHOWING 1-10 OF 109 REFERENCES
SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles
TLDR
This database of word and character frequencies based on a corpus of film and television subtitles is the first to include information about the contextual diversity of the words and to provide good frequency estimates for multi-character words and the different syntactic roles in which the words are used. Expand
SUBTLEX-NL: A new measure for Dutch word frequency based on film subtitles
TLDR
A new database of Dutch word frequencies based on film and television subtitles is presented, and an accessibility measure based on contextual diversity explains more of the variance in accuracy and RT than does the raw frequency of occurrence counts. Expand
Subtitle-Based Word Frequencies as the Best Estimate of Reading Behavior: The Case of Greek
TLDR
Examination of SUBTLEX-GR, a subtitled-based corpus consisting of more than 27 million Modern Greek words, showed that frequencies estimated from a subtitle corpus explained the obtained results significantly better than traditional frequencies derived from written corpora. Expand
SUBTLEX-ESP: Spanish word frequencies based on film subtitles
TLDR
This study presents a subtitle-based word frequency list for Spanish, one of the most widely spoken languages, and finds that the subtitle frequencies explained 6% more of the variance than the existing written frequencies in lexical decision, and 2% extra in word naming. Expand
Contextual Diversity, Not Word Frequency, Determines Word-Naming and Lexical Decision Times
TLDR
It is argued that the results reflect the importance of likely need in memory processes, and that the continuity between reading and memory suggests using principles from memory research to inform theories of reading. Expand
Assessing the Usefulness of Google Books’ Word Frequencies for Psycholinguistic Research on Word Processing
TLDR
It is found that, despite the massive corpus on which the Google estimates are based, the Google American English frequencies explain 11% less of the variance in the lexical decision times from the English Lexicon Project than the SUBTLEX-US word frequencies, based on a corpus of 51 million words from film and television subtitles. Expand
Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English
TLDR
The size of the corpus, the language register on which the corpus is based, and the definition of the frequency measure were investigated, finding that lemma frequencies are not superior to word form frequencies in English and that a measure of contextual diversity is better than a measure based on raw frequency of occurrence. Expand
The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words
TLDR
The high correlation between the BLP and ELP data indicates that a high percentage of variance in lexical decision data sets is systematic variance, rather than noise, and that the results of megastudies are rather robust with respect to the selection and presentation of the stimuli. Expand
The word frequency effect: a review of recent developments and implications for the choice of frequency estimates in German.
TLDR
It is found that the commonly used Celex frequencies are the least powerful to predict lexical decision times in the German language. Expand
EsPal: One-stop shopping for Spanish word properties
TLDR
EsPal is a Web-accessible repository containing a comprehensive set of properties of Spanish words, based on an extensible set of data sources, beginning with a 300 million token written database and a 460 million token subtitle database. Expand
...
1
2
3
4
5
...