IDF revisited: a simple new derivation within the Robertson-Spärck Jones probabilistic model

@inproceedings{Lee2007IDFRA,
  title={IDF revisited: a simple new derivation within the Robertson-Sp{\"a}rck Jones probabilistic model},
  author={Lillian Lee},
  booktitle={SIGIR},
  year={2007}
}
  • Lillian Lee
  • Published in SIGIR 8 May 2007
  • Computer Science
There have been a number of prior attempts to theoretically justify the effectiveness of the inverse document frequency (IDF). Those that take as their starting point Robertson and Sparck Jones's probabilistic model are based on strong or complex assumptions. We show that a more intuitively plausible assumption suffices. Moreover, the new assumption, while conceptually very simple, provides a solution to an estimation problem that had been deemed intractable by Robertson and Walker (1997). 
Generalized inverse document frequency
TLDR
A new, more generalized form of IDF is derived that is based on the Robertson-Sparck Jones relevance weight, and it is shown that generalized IDF outperforms classical versions of IDF on a number of ad hoc retrieval tasks. Expand
Efficient and Effective Higher Order Proximity Modeling
TLDR
This work provides further evidence that term-dependency features not captured by bag-of-words models can reliably improve retrieval effectiveness, and presents a new variation on the highly-effective MRF model that relies on a BM25-derived potential. Expand
Automatic Term Reweighting for Query Expansion
TLDR
This work found that reweighting through term frequency merging is more effective than standard query expansion, which reduces the impact of spurious expansion terms being over represented in the modified query. Expand
Interpreting TF-IDF term weights as making relevance decisions
TLDR
A novel probabilistic retrieval model forms a basis to interpret the TF-IDF term weights as making relevance decisions, and it is shown that the term-frequency factor of the ranking formula can be rendered into different term- frequency factors of existing retrieval systems. Expand
Cs 674/info 630: Advanced Language Technologies
P~ θ : V 7→ [0, 1], where ~ θ is an element of the m-dimensional probability simplex. Hence the probability assigned to a single term vj is defined as: P~ θ (vj) def = θ[j]. Also recall from theExpand
Efficient and effective retrieval using Higher-Order proximity models
  • X. Lu
  • Computer Science
  • 2017
TLDR
Information Retrieval systems leveraging proximity heuristics to estimate the relevance of a document have shown to be effective, however, the computational cost is high. Expand
Improvements to BM25 and Language Models Examined
TLDR
This investigation finds that once trained (using particle swarm optimization) there is very little difference in performance between these functions, that relevance feedback is effective, that stemming is effective and that it remains unclear which function is best over-all. Expand
Scalable Text Mining with Sparse Generative Models
TLDR
A unifying formalization for generative text models is defined, bringing together research traditions that have used formally equivalent models, but ignored parallel developments, and reduces the computational complexity of the common text mining operations according to sparsity. Expand
Combining Modifications to Multinomial Naive Bayes for Text Classification
TLDR
The optimized combination of popular modifications to generative models in the context of MNB text classification results in over 20% mean reduction in classification errors compared to baseline MNB models, reducing the gap between SVM and MNB mean performance by over 60%. Expand
Notice of RetractionEmpirical study of IDF on text classification dataset
  • Ziqiang Li, M. Zhou
  • Mathematics
  • 2010 3rd International Conference on Computer Science and Information Technology
  • 2010
This paper observes and analyses IDF and it's properties on the best TC dataset. We checkout the Zipf law of occuring frequence(OF) and document frequence(DF) of features. And we pay much attentionExpand
...
1
2
...

References

SHOWING 1-10 OF 17 REFERENCES
Understanding inverse document frequency: on theoretical arguments for IDF
  • S. Robertson
  • Mathematics, Computer Science
  • J. Documentation
  • 2004
TLDR
It is shown that the Information Theory approaches are problematic, but that there are good theoretical justifications of both IDF and TF*IDF in the traditional probabilistic model of information retrieval. Expand
Relevance information: a loss of entropy but a gain for IDF?
TLDR
The main result is a formal framework uncovering the close relationship of a generalised idf and the BIR model, and a Poisson-based idf is superior to the classical idf, where the superiority is particularly evident for long queries. Expand
Inverse Document Frequency (IDF): A Measure of Deviations from Poisson
TLDR
In inverse document frequency (IDF), a quantity borrowed from Information Retrieval, is used to distinguish words like somewhat and boycott, but boycott is a better keyword because its IDF is farther from what would be expected by chance (Poisson). Expand
A Note on Inverse Document Frequency Weighting Scheme
Based on the Shannon information theory, a measure for term value is introduced. This study is an attempt to provide a theoretical justification for the inverse document frequency (IDF) weightingExpand
Why Inverse Document Frequency?
TLDR
It is shown that the IDF is the optimal weight associated with a word-feature in an information retrieval setting where the authors treat each document as the query that retrieves itself, which means IDF is optimal for document self-retrieval. Expand
Using Probabilistic Models of Document Retrieval without Relevance Information
TLDR
This paper considers the situation where no relevance information is available, that is, at the start of the search, based on a probabilistic model, and proposes strategies for the initial search and an intermediate search. Expand
A theory of term weighting based on exploratory data analysis
TLDR
It is argued that exploratory data analysis can be a valuable tool for research whose goal is the development of an explanatory theory of information retrieval. Expand
A formal study of information retrieval heuristics
TLDR
A formal study of retrieval heuristics is presented and it is found that the empirical performance of a retrieval formula is tightly related to how well it satisfies basic desirable constraints. Expand
Relevance weighting of search terms
TLDR
This paper examines statistical techniques for exploiting relevance information to weight search terms using information about the distribution of index terms in documents in general and shows that specific weighted search methods are implied by a general probabilistic theory of retrieval. Expand
A statistical interpretation of term specificity and its application in retrieval
  • K. Jones
  • Mathematics, Computer Science
  • J. Documentation
  • 2004
TLDR
It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms. Expand
...
1
2
...