On the resemblance and containment of documents

@article{Broder1997OnTR,
  title={On the resemblance and containment of documents},
  author={Andrei Z. Broder},
  journal={Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171)},
  year={1997},
  pages={21-29}
}
  • A. Broder
  • Published 11 June 1997
  • Computer Science
  • Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171)
Given two documents A and B we define two mathematical notions: their resemblance r(A, B) and their containment c(A, B) that seem to capture well the informal notions of "roughly the same" and "roughly contained." The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document. This paper… 

Identifying and Filtering Near-Duplicate Documents

The algorithm for filtering near-duplicate documents discussed here has been successfully implemented and has been used for the last three years in the context of the AltaVista search engine.

Comparison of Standard and Zipf-Based Document Retrieval Heuristics

It turns out that the new heuristic for inexact top K retrieval is not better than index elimination, and a combination of both heuristics yields the best results.

Distance Measures and Smoothing Methodology for Imputing Features of Documents

We suggest a new class of metrics for measuring distances between documents, generalizing the well-known resemblance distance. We then show how to combine distance measures with statistical smoothing

Estimating set intersection using small samples

By using more advanced estimation techniques, it is shown that one can significantly reduce sample sizes without compromising accuracy, or conversely, obtain more accurate results from the same samples.

The Similarity Index

This investigation was conducted with the intent of implementing a hash value for a document that captures its salient characteristics, such that a repository can be queried for like values and retrieve all “similar” documents.

Detecting Short Passages of Similar Text in Large Document Collections

The method exploits the characteristic distribution of word trigrams, and measures to determine similarity are based on set theoretic principles, and has been successfully used to detect plagiarism in students’ work.

Approximate Structural Consistency

An approximate algorithm is described which decides if I is close to a target regular schema (DTD) and this property is testable, i.e. can be solved in time independent of the size of the input document, by just sampling I.

Syntactic similarity of Web documents

  • Álvaro R. PereiraN. Ziviani
  • Computer Science
    Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726)
  • 2003
Two methods for evaluating the syntactic similarity between documents are presented and compared, given an original document and some candidates, and two methods find documents that have some similarity relationship with the original document.

A Scalable System for Identifying Co-derivative Documents

Spex is presented, a novel hash-based algorithm for extracting duplicated chunks from a document collection and deco, a prototype system that makes use of spex, is described.

The power of two min-hashes for similarity search among hierarchical data objects

This study looks at data objects that are represented using leaf-labeled trees denoting a set of elements at the leaves organized in a hierarchy, and compute sketches of such trees by propagating min-hash computations up the tree using locality-sensitive hashing.
...

References

SHOWING 1-10 OF 11 REFERENCES

SCAM: A Copy Detection Mechanism for Digital Documents

A new scheme for detecting copies based on comparing the word frequency occurrences of the new document against those of registered documents, and an experimental comparison between this scheme and COPS, a detection scheme based on sentence overlap is reported on.

Copy detection mechanisms for digital documents

This paper proposes a system for registering documents and then detecting copies, either complete copies or partial copies, and describes algorithms for such detection, and metrics required for evaluating detection mechanisms (covering accuracy, efficiency, and security).

Min-wise independent permutations (extended abstract)

This research was motivated by the fact that such a family of permutations is essential to the algorithm used in practice by the AltaVista web index software to detect and filter near-duplicate documents.

Building a scalable and accurate copy detection mechanism

This paper study's the performance of various copy detection mechanisms, including the disk storage requirements, main memory requirements, response times for registration, and response time for querying, and contrast performance to the accuracy of the mechanisms (how well they detectpartial copies).

Syntactic Clustering of the Web

The Probabilistic Method

A particular set of problems - all dealing with “good” colorings of an underlying set of points relative to a given family of sets - is explored.

Finding Similar Files in a Large File System

Application of sif can be found in file management, information collecting, program reuse, file synchronization, data compression, and maybe even plagiarism detection.

Some applications of Rabin’s fingerprinting method

This paper presents an implementation and several applications of BenRabin's fingerprinting scheme that take considerable advantage of its algebraic properties.

Scalable Document Fingerprinting

  • Proceedings of the Second USENIX Workshop on Electronic Commerce
  • 1996

Some essential ideas behind the resemblance definition and computation were developed in conversations with Greg Nelson. The clustering of the entire Web was done in collaboration with Steve Glassman

  • Some essential ideas behind the resemblance definition and computation were developed in conversations with Greg Nelson. The clustering of the entire Web was done in collaboration with Steve Glassman