# On the resemblance and containment of documents

@article{Broder1997OnTR, title={On the resemblance and containment of documents}, author={Andrei Z. Broder}, journal={Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171)}, year={1997}, pages={21-29} }

Given two documents A and B we define two mathematical notions: their resemblance r(A, B) and their containment c(A, B) that seem to capture well the informal notions of "roughly the same" and "roughly contained." The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document. This paper…

## 1,877 Citations

### Identifying and Filtering Near-Duplicate Documents

- Computer ScienceCPM
- 2000

The algorithm for filtering near-duplicate documents discussed here has been successfully implemented and has been used for the last three years in the context of the AltaVista search engine.

### Comparison of Standard and Zipf-Based Document Retrieval Heuristics

- Computer Science
- 2010

It turns out that the new heuristic for inexact top K retrieval is not better than index elimination, and a combination of both heuristics yields the best results.

### Distance Measures and Smoothing Methodology for Imputing Features of Documents

- Computer Science
- 2005

We suggest a new class of metrics for measuring distances between documents, generalizing the well-known resemblance distance. We then show how to combine distance measures with statistical smoothing…

### Estimating set intersection using small samples

- MathematicsACSC
- 2010

By using more advanced estimation techniques, it is shown that one can significantly reduce sample sizes without compromising accuracy, or conversely, obtain more accurate results from the same samples.

### The Similarity Index

- Computer Science
- 2011

This investigation was conducted with the intent of implementing a hash value for a document that captures its salient characteristics, such that a repository can be queried for like values and retrieve all “similar” documents.

### Detecting Short Passages of Similar Text in Large Document Collections

- Computer ScienceEMNLP
- 2001

The method exploits the characteristic distribution of word trigrams, and measures to determine similarity are based on set theoretic principles, and has been successfully used to detect plagiarism in students’ work.

### Approximate Structural Consistency

- Computer ScienceSOFSEM
- 2010

An approximate algorithm is described which decides if I is close to a target regular schema (DTD) and this property is testable, i.e. can be solved in time independent of the size of the input document, by just sampling I.

### Syntactic similarity of Web documents

- Computer ScienceProceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726)
- 2003

Two methods for evaluating the syntactic similarity between documents are presented and compared, given an original document and some candidates, and two methods find documents that have some similarity relationship with the original document.

### A Scalable System for Identifying Co-derivative Documents

- Computer ScienceSPIRE
- 2004

Spex is presented, a novel hash-based algorithm for extracting duplicated chunks from a document collection and deco, a prototype system that makes use of spex, is described.

### The power of two min-hashes for similarity search among hierarchical data objects

- Computer SciencePODS
- 2008

This study looks at data objects that are represented using leaf-labeled trees denoting a set of elements at the leaves organized in a hierarchy, and compute sketches of such trees by propagating min-hash computations up the tree using locality-sensitive hashing.

## References

SHOWING 1-10 OF 11 REFERENCES

### SCAM: A Copy Detection Mechanism for Digital Documents

- Computer ScienceDL
- 1995

A new scheme for detecting copies based on comparing the word frequency occurrences of the new document against those of registered documents, and an experimental comparison between this scheme and COPS, a detection scheme based on sentence overlap is reported on.

### Copy detection mechanisms for digital documents

- Computer ScienceSIGMOD '95
- 1995

This paper proposes a system for registering documents and then detecting copies, either complete copies or partial copies, and describes algorithms for such detection, and metrics required for evaluating detection mechanisms (covering accuracy, efficiency, and security).

### Min-wise independent permutations (extended abstract)

- Computer ScienceSTOC '98
- 1998

This research was motivated by the fact that such a family of permutations is essential to the algorithm used in practice by the AltaVista web index software to detect and filter near-duplicate documents.

### Building a scalable and accurate copy detection mechanism

- Computer ScienceDL '96
- 1996

This paper study's the performance of various copy detection mechanisms, including the disk storage requirements, main memory requirements, response times for registration, and response time for querying, and contrast performance to the accuracy of the mechanisms (how well they detectpartial copies).

### The Probabilistic Method

- Computer ScienceSODA
- 1992

A particular set of problems - all dealing with “good” colorings of an underlying set of points relative to a given family of sets - is explored.

### Finding Similar Files in a Large File System

- Computer ScienceUSENIX Winter
- 1994

Application of sif can be found in file management, information collecting, program reuse, file synchronization, data compression, and maybe even plagiarism detection.

### Some applications of Rabin’s fingerprinting method

- Computer Science, Mathematics
- 1993

This paper presents an implementation and several applications of BenRabin's fingerprinting scheme that take considerable advantage of its algebraic properties.

### Scalable Document Fingerprinting

- Proceedings of the Second USENIX Workshop on Electronic Commerce
- 1996

### Some essential ideas behind the resemblance definition and computation were developed in conversations with Greg Nelson. The clustering of the entire Web was done in collaboration with Steve Glassman

- Some essential ideas behind the resemblance definition and computation were developed in conversations with Greg Nelson. The clustering of the entire Web was done in collaboration with Steve Glassman