Parallelized Near-Duplicate Document Detection Algorithm for Large Scale Chinese Web Pages

@article{Wei2012ParallelizedND,
  title={Parallelized Near-Duplicate Document Detection Algorithm for Large Scale Chinese Web Pages},
  author={Yongzhuang Wei and Shuai Wang and Chunfeng Yuan and Yihua Huang},
  journal={2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies},
  year={2012},
  pages={523-528}
}
A large scale of duplicate and near-duplicate web pages on the Internet create a lot of problems for search engines. Currently each single duplicate and near-duplicate web document detection algorithms cannot achieve both good performance and accuracy. Also most of them are designed to process English documents and not able to use for Chinese documents… CONTINUE READING