Understanding inverse document frequency: on theoretical arguments for IDF

@article{Robertson2004UnderstandingID,
  title={Understanding inverse document frequency: on theoretical arguments for IDF},
  author={Stephen E. Robertson},
  journal={J. Documentation},
  year={2004},
  volume={60},
  pages={503-520}
}
  • S. Robertson
  • Published 1 October 2004
  • Computer Science
  • J. Documentation
The term‐weighting function known as IDF was proposed in 1972, and has since been extremely widely used, usually as part of a TF*IDF function. It is often described as a heuristic, and many papers have been written (some based on Shannon's Information Theory) seeking to establish some theoretical basis for it. Some of these attempts are reviewed, and it is shown that the Information Theory approaches are problematic, but that there are good theoretical justifications of both IDF and TF*IDF in… 

Figures from this paper

A simple probabilistic explanation of term frequency-inverse document frequency (tf-idf) heuristic (and variations motivated by this explanation)

This paper provides a simple probabilistic explanation for the tf-idf heuristic and shows that the ideas behind explanation can help to come up with more complex formulas which will hopefully lead to a more adequate detection of keywords.

Why Language Models and Inverse Document Frequency for Information Retrieval ?

Why statistical language models hold the same information content as TF.IDF is reviewed to investigate the foundations of these two models, which are examined through the information theoretical framework through entropy formulas and mathematically derive TF.

Comparative Analysis of IDF Methods to Determine Word Relevance in Web Document

Different derivations of inverse document frequency to measure the weight of terms are discussed and compared and the most famous derivations follows from the Robertson-Spark Jones relevance weight are compared.

Exploring the Stability of IDF Term Weighting

This study investigated the similarities and differences between IDF distributions based on the global collection and on different samples and tested the stability of the IDF measure across collections.

Relevance information: a loss of entropy but a gain for IDF?

The main result is a formal framework uncovering the close relationship of a generalised idf and the BIR model, and a Poisson-based idf is superior to the classical idf, where the superiority is particularly evident for long queries.

Alternatives to Classic BM25-IDF based on a New Information Theoretical Framework

  • W. Ke
  • Computer Science
    2022 IEEE International Conference on Big Data (Big Data)
  • 2022
A new information metric called DLITE is developed and derived from it an alternative to IDF, namely iDL, for term weighting and scoring in ranked information retrieval, and it is expected to be applicable in many other areas of big-data analytics and machine learning where further research will be valuable.

A hypergeometric test interpretation of a common tf-idf variant

It is shown that the hypergeometric test from classical statistics corresponds well with a common tf-idf variant on selected real-data information retrieval tasks, and a mathematical argument is set forth that suggests the tf-IDf variant functions as an approximation to the hyper Geometric test (and vice versa).

Seminar : Techniques for implementing main memory data bases Text analysis : TF-IDF

This work will deal with a new approach to TF-IDF, which does not introduce a modification to the algorithm itself but rather the target document, and shows a way to apply TF- IDF to information retrieval by proposing a document ranking formula.

A Study on Term Weighting for Text Categorization: A Novel Supervised Variant of tf.idf

A supervised variant of the tf.idf scheme is proposed, based on computing the usual idf factor without considering documents of the category to be recognized, so that importance of terms appearing only within it is not underestimated.

Deriving TF-IDF as a Fisher Kernel

It is shown that the DCM Fisher kernel has components that are similar to the term frequency (TF) and inverse document frequency (IDF) factors of the standard TF-IDF method for representing documents.
...

References

SHOWING 1-10 OF 37 REFERENCES

Why Inverse Document Frequency?

It is shown that the IDF is the optimal weight associated with a word-feature in an information retrieval setting where the authors treat each document as the query that retrieves itself, which means IDF is optimal for document self-retrieval.

An information-theoretic perspective of tf-idf measures

Inverse Document Frequency (IDF): A Measure of Deviations from Poisson

In inverse document frequency (IDF), a quantity borrowed from Information Retrieval, is used to distinguish words like somewhat and boycott, but boycott is a better keyword because its IDF is farther from what would be expected by chance (Poisson).

A frequency-based and a poisson-based definition of the probability of being informative

It is shown that an intuitive idf-based probability function for the probability of a term being informative assumes disjoint document events, and by assuming documents to be independent rather than disjointed, the framework is useful for understanding and deciding the parameter estimation and combination in probabilistic retrieval models.

The probability ranking principle in IR

It is shown that the principle that documents should be ranked in order of the probability of relevance or usefulness can be justified under certain assumptions, but that in cases where these assumptions do not hold, the principle is not valid.

Relevance-Based Language Models

This work proposes a novel technique for estimating a relevance model with no training data and demonstrates that it can produce highly accurate relevance models, addressing important notions of synonymy and polysemy.

A statistical interpretation of term specificity and its application in retrieval

It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms.

Relevance weighting of search terms

This paper examines statistical techniques for exploiting relevance information to weight search terms using information about the distribution of index terms in documents in general and shows that specific weighted search methods are implied by a general probabilistic theory of retrieval.

Improving the suitability of imperfect transcriptions for information retrieval from spoken documents

  • M. SieglerM. Witbrock
  • Computer Science
    1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258)
  • 1999
A method for measuring the relevance of documents to queries when information about the probability of word transcription error is available is described, and a method is presented for estimating word error probability in speech recognition engines that use word graphs (lattices).

Using Probabilistic Models of Document Retrieval without Relevance Information

This paper considers the situation where no relevance information is available, that is, at the start of the search, based on a probabilistic model, and proposes strategies for the initial search and an intermediate search.