Finding Similar Files in a Large File System

  title={Finding Similar Files in a Large File System},
  author={Udi Manber},
  booktitle={USENIX Winter},
We present a tool, called sif, for finding all similar files in a large file system. Files are considered similar if they have significant number of common pieces, even if they are very different otherwise. For example, one file may be contained, possibly with some changes, in another file, or a file may be a reorganization of another file. The running time for finding all groups of similar files, even for as little as 25% similarity, is on the order of 500MB to 1GB an hour. The amount of… CONTINUE READING
Highly Influential
This paper has highly influenced a number of papers. REVIEW HIGHLY INFLUENTIAL CITATIONS
Highly Cited
This paper has 645 citations. REVIEW CITATIONS



Citations per Year

645 Citations

Semantic Scholar estimates that this publication has 645 citations based on the available data.

See our FAQ for additional information.