• Corpus ID: 145027099

Methods for Determining the Similarity of Documents

  title={Methods for Determining the Similarity of Documents},
  author={Christian Olaf H{\"a}usler and A. M. Dreiling},



When stopword lists make the difference

It is shown that through implementing the original Okapi form or certain ones derived from the Divergence from Randomness (DFR) paradigm, significantly lower performance levels may result when using short or no stopword lists.

Probabilistic models of information retrieval based on measuring the divergence from randomness

A framework for deriving probabilistic models of Information Retrieval using term-weighting models obtained in the language model approach by measuring the divergence of the actual term distribution from that obtained under a random process is introduced.

A language modeling approach to information retrieval

This work proposes an approach to retrieval based on probabilistic language modeling and integrates document indexing and document retrieval into a single model, which significantly outperforms standard tf.idf weighting on two different collections and query sets.

The Probabilistic Relevance Framework: BM25 and Beyond

This work presents the PRF from a conceptual point of view, describing the probabilistic modelling assumptions behind the framework and the different ranking algorithms that result from its application: the binary independence model, relevance feedback models, BM25 and BM25F.

Information-based models for ad hoc IR

A long-standing hypothesis in IR, namely the fact that the difference in the behaviors of a word at the document and collection levels brings information on the significance of the word for the document, is shown to lead to simpler and better models.

An object oriented architecture

New ideas in this paper include the concept and implementation of abstract instructions, using floating point addresses to solve the small object problem, and a novel context allocation/access mechanism.

Learning Document Similarity Using Natural Language Processing

This paper addresses the problem of organizing documents into meaningful groups according to their content and to visualize a text collection, providing an overview of the range of documents and of their relationships, so that they can be browsed more easily.

Overview of Stemming Algorithms for Indian and Non-Indian Languages

This paper has discussed different stemming algorithm for non-Indian and Indian language, methods of stemming, accuracy and errors, and widely uses in Information Retrieval system and reduces the size of index files.

( Nov . 2005 ) . “ Information extraction : Distilling structured data from unstructured text ”

  • Learning Document Similarity Using Natural Language Processing .
  • 2009