Skip to search form
Skip to main content
Skip to account menu
Semantic Scholar
Semantic Scholar's Logo
Search 225,206,853 papers from all fields of science
Search
Sign In
Create Free Account
Heritrix
Heritrix is a web crawler designed for web archiving. It was written by the Internet Archive. It is free software license and written in Java. The…
Expand
Wikipedia
(opens in a new tab)
Create Alert
Alert
Related topics
Related topics
13 relations
CiteSeerX
Free software license
Java
List of Web archiving initiatives
Expand
Papers overview
Semantic Scholar uses AI to extract papers important to this topic.
2020
2020
Behind the Scenes of Web Archiving: Metadata of Harvested Websites
2020
Corpus ID: 198351449
Introduction The web is fraught with contradiction. On the one hand, the web has become a central means of communication in…
Expand
2016
2016
The Design and Implementation of a High-Efficiency Distributed Web Crawler
Qiumei Pu
IEEE 14th Intl Conf on Dependable, Autonomic and…
2016
Corpus ID: 15138752
With the rapid development of the Internet, the amount of data on the Internet become more and more huge, and the website…
Expand
2013
2013
SAAD, a content based Web Spam Analyzer and Detector
Víctor M. Prieto
,
M. Álvarez
,
Fidel Cacheda
Journal of Systems and Software
2013
Corpus ID: 37292356
2012
2012
Web crawler middleware for search engine digital libraries: a case study for citeseerX
Jian Wu
,
Pradeep B. Teregowda
,
+6 authors
C. Lee Giles
ACM International Workshop on Web Information and…
2012
Corpus ID: 18513666
Middleware is an important part of many search engine web crawling processes. We developed a middleware, the Crawl Document…
Expand
2011
2011
A Vertical Search Engine for School Information Based on Heritrix and Lucene
Hyo-Bong Lee
,
F. Nazareno
,
Seung-Hyun Jung
,
W. Cho
International Conference on Hybrid Information…
2011
Corpus ID: 46063274
The contents on the web are increasing exponentially as the rapid development of the Internet applications and services continues…
Expand
2011
2011
Visualizing the Refactoring of Classes via Clustering
K. Cassell
,
C. Anslow
,
L. Groves
,
Peter M. Andreae
Australasian Computer Science Conference
2011
Corpus ID: 18756125
When developing object-oriented classes, it is difficult to determine how to best reallocate the members of large, complex…
Expand
2011
2011
An improved topic relevance algorithm for focused crawling
Hongwei Hao
,
Cui-Xia Mu
,
Xu-Cheng Yin
,
Shen Li
,
Zhi-Bin Wang
IEEE International Conference on Systems, Man and…
2011
Corpus ID: 36558401
Topic relevance of pages and hyperlinks is the key issue in focused crawling. In this paper, an improved topic relevance…
Expand
2011
2011
Study and Application of Web Crawler Algorithm Based on Heritrix
D. Liu
,
Xianzhi Fan
2011
Corpus ID: 60509489
In this paper, the web crawler in search engine was introduced firstly, based on the detailed analysis of the system architecture…
Expand
2011
2011
Wikicrawl: reusing semantic web data in authoring Wikipedia
M. Yau
,
A. Cristea
2011
Corpus ID: 12208957
This paper presents the main part of a project conducted at the University of Warwick regarding a tool for retrieving semantic…
Expand
2009
2009
A study of online transaction platform based on interactive search engine
Qifang Li
,
Ting Yang
16th International Conference on Industrial…
2009
Corpus ID: 1198798
Online transaction becomes a main way of e-commerce at present. Information discovery and price discovery in e-commerce are…
Expand
By clicking accept or continuing to use the site, you agree to the terms outlined in our
Privacy Policy
(opens in a new tab)
,
Terms of Service
(opens in a new tab)
, and
Dataset License
(opens in a new tab)
ACCEPT & CONTINUE
or Only Accept Required