Learn More
Web documents that are either partially or completely duplicated in content are easily found on the Internet these days. Not only these documents create redundant information on the Web, which take longer to filter unique information and cause additional storage space, but also they degrade the efficiency of Web information retrieval. In this paper, we(More)
DETECTING SIMILAR HTML DOCUMENTS USING A SENTENCE-BASED COPY DETECTION APPROACH Rajiv Yerra Department of Computer Science Master of Science Web documents that are either partially or completely duplicated in content are easily found on the Internet these days. Not only these documents create redundant information on the Web, which take longer to filter(More)
Similar Web pages are easily found on Internet. The redundancy of information severely slows down internet applications such as crawl module of search engine, and could lead to waste of storage in the indexing procedure. In this paper, we proposed a content-based approach for detecting webpage duplications. The algorithm contains three parts: i)(More)
  • 1