Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

@article{Turney2001MiningTW,
  title={Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL},
  author={Peter D. Turney},
  journal={ArXiv},
  year={2001},
  volume={cs.LG/0212033}
}
This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMI-IR, uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words. PMI-IR is empirically evaluated using 80 synonym test questions from the Test of English as a Foreign Language (TOEFL) and 50 synonym test questions from a collection of tests for students of… Expand
Using Cooccurrence Statistics and the Web to Discover Synonyms in a Technical Language
TLDR
The results indicate that AVMI is very good at spotting synonym couples among pairs of unrelated terms and that it outperforms more standard methods based on contextual cosine similarity, but it is not able to distinguish between synonyms and other semantically related terms. Expand
Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words
TLDR
A new corpus-based method, called Second Order Co-occurrencePMI (SOC-PMI), uses Pointwise Mutual Information to sort lists of important neighbor words of the two target words to calculate the relative semantic similarity. Expand
Automatic LSA-Based Retrieval of Synonyms (for Search Space Extension)
This paper describes a research, experiments, and theoretical considerations leading towards automatic computational thesaurus construction based upon identification of synonyms in large sets ofExpand
Self-Supervised Synonym Extraction from the Web
TLDR
This paper presents a synonym extraction framework based on self-supervised learning that model the extraction of synonyms from sentences as a sequential labeling problem and automatically generate labeled training samples by using structured knowledge from online encyclopedias and some generic heuristic rules. Expand
Modeling Information Scent: A Comparison of LSA, PMI and GLSA Similarity Measures on Common Tests and Corpora
TLDR
A comparison among three systems that estimate semantic similarity between words shows that for large corpora PMI works best on word similarity tests, and GLSA on synonymy tests, while for the smaller TASA corpus, GLSA produced the best performance on most tests. Expand
Synonym Measurement Through Semantic Similarity Using the SOC-PMI Method
Abstract: Measurement of synonyms can be an important task in measuring word similarity. This work cannot be done syntactically, but must dig deeper about its semantics. Semantic relations can beExpand
Using Filtered Second Order Co-occurrence Matrix to Improve the Traditional Co-occurrence Model
Using co-occurrence statistics to measure word similarities/relatedness has applications in many areas of natural language processing. Our experiment results also indicate that two words with zeroExpand
Spreading semantic information by Word Sense Disambiguation
TLDR
An unsupervised approach to solve semantic ambiguity based on the integration of the Personalized PageRank algorithm with word-sense frequency information is presented, which includes semantic information that obtains the appropriate word- sense via support from two sources. Expand
An Integrated Approach to Measuring Semantic Similarity between Words Using Information Available on the Web
TLDR
Experimental results show that the proposed semantic similarity measure outperforms all the existing web based semantic similarity measures by a wide margin, and significantly improves the accuracy in a named entity clustering task, proving the capability of the proposed measure to capture semantic similarity using web content. Expand
A Combined Method to Measure the Semantic Similarity between Words
Measuring semantic similarity between words is vital for various applications in natural language processing, such as language modeling, information retrieval, and document clustering. This methodExpand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 37 REFERENCES
Book Reviews: EuroWordNet: A Multilingual Database with Lexical Semantic Networks
WordNet, the on-line English thesaurus and lexical database developed at Princeton University by George Miller and his colleagues (Fellbaum 1998), has proved to be an extremely important resourceExpand
Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy
This paper presents a new approach for measuring semantic similarity/distance between words and concepts. It combines a lexical taxonomy structure with corpus statistical information so that theExpand
Finding Semantic Similarity in Raw Text: the Deese Antonyms
TLDR
Using statistical methods combined with robust syntactic analysis, SEXTANT was able to find many of the intuitive pairings between semantically similar words studied by Deese [Deese, 1954]. Expand
Using Statistics in Lexical Analysis
TLDR
The computational tools available for studying machine-readable corpora are at present still rather primitive and use these corpora and the basic concordancing tool mentioned above to fill in detailed syntactic descriptions (prompting a move, towards more thorough descriptions of lexical syntax). Expand
WordNet : an electronic lexical database
TLDR
The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented. Expand
Using WordNet as a Knowledge Base for Measuring Semantic Similarity between Words
TLDR
This paper proposes the use of WordNet as a knowledge base in an information retrieval task and proposes a semantic similarity measure which can be used as an alternative to pattern matching in the comparison process. Expand
Computational Methods for Intelligent Information Access
TLDR
A survey of the computational requirements for managing LSI-encoded databases as well as current and future applications of LSI is presented, with a promising way to improve users’ access to many kinds of textual materials. Expand
Automatic Retrieval and Clustering of Similar Words
TLDR
A word similarity measure based on the distributional pattern of words allows the automatically constructed thesaurus to be significantly closer to WordNet than Roget Thesaurus is. Expand
Word Association Norms, Mutual Information and Lexicography
TLDR
The proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, making it possible to estimate norms for tens of thousands of words. Expand
Automatic Query Expansion Using SMART: TREC 3
TLDR
This work continues the work in TREC 3, performing runs in the routing, ad-hoc, and foreign language environments, with a major focus on massive query expansion, adding from 300 to 530 terms to each query. Expand
...
1
2
3
4
...