# IDF revisited: a simple new derivation within the Robertson-Spärck Jones probabilistic model

@inproceedings{Lee2007IDFRA, title={IDF revisited: a simple new derivation within the Robertson-Sp{\"a}rck Jones probabilistic model}, author={Lillian Lee}, booktitle={SIGIR}, year={2007} }

There have been a number of prior attempts to theoretically justify the effectiveness of the inverse document frequency (IDF). Those that take as their starting point Robertson and Sparck Jones's probabilistic model are based on strong or complex assumptions. We show that a more intuitively plausible assumption suffices. Moreover, the new assumption, while conceptually very simple, provides a solution to an estimation problem that had been deemed intractable by Robertson and Walker (1997).

#### Topics from this paper

#### 11 Citations

Generalized inverse document frequency

- Computer Science
- CIKM '08
- 2008

A new, more generalized form of IDF is derived that is based on the Robertson-Sparck Jones relevance weight, and it is shown that generalized IDF outperforms classical versions of IDF on a number of ad hoc retrieval tasks. Expand

Efficient and Effective Higher Order Proximity Modeling

- Computer Science
- ICTIR
- 2016

This work provides further evidence that term-dependency features not captured by bag-of-words models can reliably improve retrieval effectiveness, and presents a new variation on the highly-effective MRF model that relies on a BM25-derived potential. Expand

Automatic Term Reweighting for Query Expansion

- Computer Science
- ADCS
- 2017

This work found that reweighting through term frequency merging is more effective than standard query expansion, which reduces the impact of spurious expansion terms being over represented in the modified query. Expand

Interpreting TF-IDF term weights as making relevance decisions

- Computer Science
- TOIS
- 2008

A novel probabilistic retrieval model forms a basis to interpret the TF-IDF term weights as making relevance decisions, and it is shown that the term-frequency factor of the ranking formula can be rendered into different term- frequency factors of existing retrieval systems. Expand

Cs 674/info 630: Advanced Language Technologies

- 2007

P~ θ : V 7→ [0, 1], where ~ θ is an element of the m-dimensional probability simplex. Hence the probability assigned to a single term vj is defined as: P~ θ (vj) def = θ[j]. Also recall from the… Expand

Efficient and effective retrieval using Higher-Order proximity models

- Computer Science
- 2017

Information Retrieval systems leveraging proximity heuristics to estimate the relevance of a document have shown to be effective, however, the computational cost is high. Expand

Improvements to BM25 and Language Models Examined

- Computer Science
- ADCS '14
- 2014

This investigation finds that once trained (using particle swarm optimization) there is very little difference in performance between these functions, that relevance feedback is effective, that stemming is effective and that it remains unclear which function is best over-all. Expand

Scalable Text Mining with Sparse Generative Models

- Computer Science
- ArXiv
- 2016

A unifying formalization for generative text models is defined, bringing together research traditions that have used formally equivalent models, but ignored parallel developments, and reduces the computational complexity of the common text mining operations according to sparsity. Expand

Combining Modifications to Multinomial Naive Bayes for Text Classification

- Computer Science
- AIRS
- 2012

The optimized combination of popular modifications to generative models in the context of MNB text classification results in over 20% mean reduction in classification errors compared to baseline MNB models, reducing the gap between SVM and MNB mean performance by over 60%. Expand

Notice of RetractionEmpirical study of IDF on text classification dataset

- Mathematics
- 2010 3rd International Conference on Computer Science and Information Technology
- 2010

This paper observes and analyses IDF and it's properties on the best TC dataset. We checkout the Zipf law of occuring frequence(OF) and document frequence(DF) of features. And we pay much attention… Expand

#### References

SHOWING 1-10 OF 17 REFERENCES

Understanding inverse document frequency: on theoretical arguments for IDF

- Mathematics, Computer Science
- J. Documentation
- 2004

It is shown that the Information Theory approaches are problematic, but that there are good theoretical justifications of both IDF and TF*IDF in the traditional probabilistic model of information retrieval. Expand

Relevance information: a loss of entropy but a gain for IDF?

- Computer Science
- SIGIR '05
- 2005

The main result is a formal framework uncovering the close relationship of a generalised idf and the BIR model, and a Poisson-based idf is superior to the classical idf, where the superiority is particularly evident for long queries. Expand

Inverse Document Frequency (IDF): A Measure of Deviations from Poisson

- Computer Science
- VLC@ACL
- 1995

In inverse document frequency (IDF), a quantity borrowed from Information Retrieval, is used to distinguish words like somewhat and boycott, but boycott is a better keyword because its IDF is farther from what would be expected by chance (Poisson). Expand

A Note on Inverse Document Frequency Weighting Scheme

- Mathematics
- 1989

Based on the Shannon information theory, a measure for term value is introduced. This study is an attempt to provide a theoretical justification for the inverse document frequency (IDF) weighting… Expand

Why Inverse Document Frequency?

- Computer Science
- NAACL
- 2001

It is shown that the IDF is the optimal weight associated with a word-feature in an information retrieval setting where the authors treat each document as the query that retrieves itself, which means IDF is optimal for document self-retrieval. Expand

Using Probabilistic Models of Document Retrieval without Relevance Information

- Computer Science
- J. Documentation
- 1979

This paper considers the situation where no relevance information is available, that is, at the start of the search, based on a probabilistic model, and proposes strategies for the initial search and an intermediate search. Expand

A theory of term weighting based on exploratory data analysis

- Computer Science
- SIGIR 1998
- 1998

It is argued that exploratory data analysis can be a valuable tool for research whose goal is the development of an explanatory theory of information retrieval. Expand

A formal study of information retrieval heuristics

- Computer Science
- SIGIR '04
- 2004

A formal study of retrieval heuristics is presented and it is found that the empirical performance of a retrieval formula is tightly related to how well it satisfies basic desirable constraints. Expand

Relevance weighting of search terms

- Computer Science
- J. Am. Soc. Inf. Sci.
- 1976

This paper examines statistical techniques for exploiting relevance information to weight search terms using information about the distribution of index terms in documents in general and shows that specific weighted search methods are implied by a general probabilistic theory of retrieval. Expand

A statistical interpretation of term specificity and its application in retrieval

- Mathematics, Computer Science
- J. Documentation
- 2004

It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms. Expand