On the resemblance and containment of documents

@article{Broder1997OnTR,
  title={On the resemblance and containment of documents},
  author={A. Broder},
  journal={Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171)},
  year={1997},
  pages={21-29}
}
  • A. Broder
  • Published 1997
  • Mathematics, History, Computer Science
  • Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171)
  • Given two documents A and B we define two mathematical notions: their resemblance r(A, B) and their containment c(A, B) that seem to capture well the informal notions of "roughly the same" and "roughly contained." The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document. This paper… CONTINUE READING
    1,671 Citations

    Topics from this paper

    Identifying and Filtering Near-Duplicate Documents
    • 391
    • PDF
    Comparison of Standard and Zipf-Based Document Retrieval Heuristics
    • PDF
    Estimating set intersection using small samples
    • 6
    The Similarity Index
    • 5
    • PDF
    Detecting Short Passages of Similar Text in Large Document Collections
    • 174
    • PDF
    Approximate Structural Consistency
    • 1
    • PDF
    Syntactic similarity of Web documents
    • Álvaro R. Pereira, N. Ziviani
    • Computer Science
    • Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726)
    • 2003
    • 18
    • PDF
    A Scalable System for Identifying Co-derivative Documents
    • 87
    • Highly Influenced
    • PDF
    The power of two min-hashes for similarity search among hierarchical data objects
    • 11
    • PDF

    References

    SHOWING 1-10 OF 14 REFERENCES
    SCAM: A Copy Detection Mechanism for Digital Documents
    • 360
    • PDF
    Copy detection mechanisms for digital documents
    • 576
    • PDF
    Syntactic Clustering of the Web
    • 1,465
    • PDF
    The Probabilistic Method
    • 5,809
    • PDF
    Finding Similar Files in a Large File System
    • 693
    • PDF
    Scalable Document Fingerprinting
    • Proceedings of the Second USENIX Workshop on Electronic Commerce
    • 1996
    Fingerprinting by random polynomials. Center for Research in Computing Technology
    • Fingerprinting by random polynomials. Center for Research in Computing Technology
    • 1981