Understanding inverse document frequency: on theoretical arguments for IDF

@article{Robertson2004UnderstandingID,
  title={Understanding inverse document frequency: on theoretical arguments for IDF},
  author={Stephen E. Robertson},
  journal={J. Documentation},
  year={2004},
  volume={60},
  pages={503-520}
}
  • S. Robertson
  • Published 1 October 2004
  • Computer Science
  • J. Documentation
The term‐weighting function known as IDF was proposed in 1972, and has since been extremely widely used, usually as part of a TF*IDF function. It is often described as a heuristic, and many papers have been written (some based on Shannon's Information Theory) seeking to establish some theoretical basis for it. Some of these attempts are reviewed, and it is shown that the Information Theory approaches are problematic, but that there are good theoretical justifications of both IDF and TF*IDF in… 

Figures from this paper

A simple probabilistic explanation of term frequency-inverse document frequency (tf-idf) heuristic (and variations motivated by this explanation)
TLDR
This paper provides a simple probabilistic explanation for the tf-idf heuristic and shows that the ideas behind explanation can help to come up with more complex formulas which will hopefully lead to a more adequate detection of keywords.
Why Language Models and Inverse Document Frequency for Information Retrieval ?
TLDR
Why statistical language models hold the same information content as TF.IDF is reviewed to investigate the foundations of these two models, which are examined through the information theoretical framework through entropy formulas and mathematically derive TF.
Comparative Analysis of IDF Methods to Determine Word Relevance in Web Document
TLDR
Different derivations of inverse document frequency to measure the weight of terms are discussed and compared and the most famous derivations follows from the Robertson-Spark Jones relevance weight are compared.
Exploring the Stability of IDF Term Weighting
TLDR
This study investigated the similarities and differences between IDF distributions based on the global collection and on different samples and tested the stability of the IDF measure across collections.
Relevance information: a loss of entropy but a gain for IDF?
TLDR
The main result is a formal framework uncovering the close relationship of a generalised idf and the BIR model, and a Poisson-based idf is superior to the classical idf, where the superiority is particularly evident for long queries.
A hypergeometric test interpretation of a common tf-idf variant
TLDR
It is shown that the hypergeometric test from classical statistics corresponds well with a common tf-idf variant on selected real-data information retrieval tasks, and a mathematical argument is set forth that suggests the tf-IDf variant functions as an approximation to the hyper Geometric test (and vice versa).
Seminar : Techniques for implementing main memory data bases Text analysis : TF-IDF
TLDR
This work will deal with a new approach to TF-IDF, which does not introduce a modification to the algorithm itself but rather the target document, and shows a way to apply TF- IDF to information retrieval by proposing a document ranking formula.
A Study on Term Weighting for Text Categorization: A Novel Supervised Variant of tf.idf
TLDR
A supervised variant of the tf.idf scheme is proposed, based on computing the usual idf factor without considering documents of the category to be recognized, so that importance of terms appearing only within it is not underestimated.
Deriving TF-IDF as a Fisher Kernel
TLDR
It is shown that the DCM Fisher kernel has components that are similar to the term frequency (TF) and inverse document frequency (IDF) factors of the standard TF-IDF method for representing documents.
IDF revisited: a simple new derivation within the Robertson-Spärck Jones probabilistic model
TLDR
This work shows that a more intuitively plausible assumption suffices to justify the effectiveness of the inverse document frequency and provides a solution to an estimation problem that had been deemed intractable by Robertson and Walker (1997).
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 36 REFERENCES
Why Inverse Document Frequency?
TLDR
It is shown that the IDF is the optimal weight associated with a word-feature in an information retrieval setting where the authors treat each document as the query that retrieves itself, which means IDF is optimal for document self-retrieval.
An information-theoretic perspective of tf-idf measures
Inverse Document Frequency (IDF): A Measure of Deviations from Poisson
TLDR
In inverse document frequency (IDF), a quantity borrowed from Information Retrieval, is used to distinguish words like somewhat and boycott, but boycott is a better keyword because its IDF is farther from what would be expected by chance (Poisson).
A frequency-based and a poisson-based definition of the probability of being informative
TLDR
It is shown that an intuitive idf-based probability function for the probability of a term being informative assumes disjoint document events, and by assuming documents to be independent rather than disjointed, the framework is useful for understanding and deciding the parameter estimation and combination in probabilistic retrieval models.
The probability ranking principle in IR
TLDR
It is shown that the principle that documents should be ranked in order of the probability of relevance or usefulness can be justified under certain assumptions, but that in cases where these assumptions do not hold, the principle is not valid.
A statistical interpretation of term specificity and its application in retrieval
TLDR
It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms.
Relevance weighting of search terms
TLDR
This paper examines statistical techniques for exploiting relevance information to weight search terms using information about the distribution of index terms in documents in general and shows that specific weighted search methods are implied by a general probabilistic theory of retrieval.
Improving the suitability of imperfect transcriptions for information retrieval from spoken documents
  • M. Siegler, M. Witbrock
  • Computer Science
    1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258)
  • 1999
TLDR
A method for measuring the relevance of documents to queries when information about the probability of word transcription error is available is described, and a method is presented for estimating word error probability in speech recognition engines that use word graphs (lattices).
Using Probabilistic Models of Document Retrieval without Relevance Information
TLDR
This paper considers the situation where no relevance information is available, that is, at the start of the search, based on a probabilistic model, and proposes strategies for the initial search and an intermediate search.
A probabilistic approach to automatic keyword indexing. Part II. An algorithm for probabilistic indexing
TLDR
An algorithm defining a measure of indexability is developed-a measure intended to reflect the relative significance of words in documents that is found to consistently produce indexes superior to those produced by another measure which had previously been identified in the literature as producing the best results.
...
1
2
3
4
...