Heritrix

Heritrix is a web crawler designed for web archiving. It was written by the Internet Archive. It is free software license and written in Java. The…

Papers overview

Semantic Scholar uses AI to extract papers important to this topic.

2020

Introduction The web is fraught with contradiction. On the one hand, the web has become a central means of communication in…

2016

With the rapid development of the Internet, the amount of data on the Internet become more and more huge, and the website…

2013

2012

Middleware is an important part of many search engine web crawling processes. We developed a middleware, the Crawl Document…

2011

The contents on the web are increasing exponentially as the rapid development of the Internet applications and services continues…

2011

When developing object-oriented classes, it is difficult to determine how to best reallocate the members of large, complex…

2011

Topic relevance of pages and hyperlinks is the key issue in focused crawling. In this paper, an improved topic relevance…

2011

In this paper, the web crawler in search engine was introduced firstly, based on the detailed analysis of the system architecture…

2011

This paper presents the main part of a project conducted at the University of Warwick regarding a tool for retrieving semantic…

2009

Online transaction becomes a main way of e-commerce at present. Information discovery and price discovery in e-commerce are…