Near-duplicate detection by instance-level constrained clustering

  title={Near-duplicate detection by instance-level constrained clustering},
  author={Grace Hui Yang and James P. Callan},
For the task of near-duplicated document detection, both traditional fingerprinting techniques used in database community and bag-of-word comparison approaches used in information retrieval community are not sufficiently accurate. This is due to the fact that the characteristics of near-duplicated documents are different from that of both "almost-identical" documents in the data cleaning task and "relevant" documents in the search task. This paper presents an instance-level constrained… CONTINUE READING
Highly Cited
This paper has 107 citations. REVIEW CITATIONS


Publications citing this paper.
Showing 1-10 of 69 extracted citations

Algorithm Theory – SWAT 2014

Lecture Notes in Computer Science • 2014
View 15 Excerpts
Highly Influenced

Efficient and Scalable Privacy-Preserving Similar Document Detection

GLOBECOM 2017 - 2017 IEEE Global Communications Conference • 2017
View 1 Excerpt

Multiple Parenting Phylogeny Relationships in Digital Images

IEEE Transactions on Information Forensics and Security • 2016
View 1 Excerpt

107 Citations

Citations per Year
Semantic Scholar estimates that this publication has 107 citations based on the available data.

See our FAQ for additional information.


Publications referenced by this paper.
Showing 1-7 of 7 references

Kappa Statistic is not Satisfactory for Assessing the Extent of Agreement between Raters

K. Gwet
Statistical Methods for Inter-rater Reliability Assessment, • 2002
View 7 Excerpts
Highly Influenced

Syntactic Clustering of the Web

View 14 Excerpts
Highly Influenced

Copy Detection Mechanisms for Digital Documents

SIGMOD Conference • 1995
View 6 Excerpts
Highly Influenced

Similar Papers

Loading similar papers…