Finding Similar Files in a Large File System

  title={Finding Similar Files in a Large File System},
  author={Udi Manber},
  booktitle={USENIX Winter},
We present a tool, called sif, for finding all similar files in a large file system. Files are considered similar if they have significant number of common pieces, even if they are very different otherwise. For example, one file may be contained, possibly with some changes, in another file, or a file may be a reorganization of another file. The running time for finding all groups of similar files, even for as little as 25% similarity, is on the order of 500MB to 1GB an hour.
645 Citations

