Improved robustness of signature-based near-replica detection via lexicon randomization

  title={Improved robustness of signature-based near-replica detection via lexicon randomization},
  author={Aleksander Kolcz and Abdur Chowdhury and Joshua Alspector},
Detection of near duplicate documents is an important problem in many data mining and information filtering applications. When faced with massive quantities of data, traditional duplicate detection techniques relying on direct inter-document similarity computation (e.g., using the cosine measure) are often not feasible given the time and memory performance constraints. On the other hand, fingerprint-based methods, such as I-Match, are very attractive computationally but may be brittle with… CONTINUE READING
Highly Cited
This paper has 94 citations. REVIEW CITATIONS
56 Citations
3 References
Similar Papers


Publications citing this paper.
Showing 1-10 of 56 extracted citations

94 Citations

Citations per Year
Semantic Scholar estimates that this publication has 94 citations based on the available data.

See our FAQ for additional information.


Publications referenced by this paper.
Showing 1-3 of 3 references

Similar Papers

Loading similar papers…