What's there and what's not?: focused crawling for missing documents in digital libraries
@article{Zhuang2005WhatsTA, title={What's there and what's not?: focused crawling for missing documents in digital libraries}, author={Ziming Zhuang and Rohit Wagle and C. Lee Giles}, journal={Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05)}, year={2005}, pages={301-310} }
Some large scale topical digital libraries, such as CiteSeer, harvest online academic documents by crawling open-access archives, university and author homepages, and authors' self-submissions. While these approaches have so far built reasonable size libraries, they can suffer from having only a portion of the documents from specific publishing venues. We propose to use alternative online resources and techniques that maximally exploit other resources to build the complete document collection…Â
Figures and Tables from this paper
55 Citations
An Approach for Focused Crawler to Harvest Digital Academic Documents in Online Digital Libraries
- Computer ScienceInt. J. Inf. Retr. Res.
- 2019
With therapidﻷ growthﻵ�growth﻽�ofﻴdigital-digital- digital-digitalﻹ information-and-user-needs, £1.5bn-worth of assets are expected to be created within the next 12 months.
Finding what is missing from a digital library: A case study in the Computer Science field
- Computer ScienceInf. Process. Manag.
- 2009
Focused Crawling : A Means to Acquire Biological Data from the Web
- Computer Science
- 2007
The main features of focused crawling are described, the research on focused crawling conducted by the research group of the author is discussed, and the problem areas associated with focused crawling not discussed in the literature are discussed.
Effective Concentrated Web Crawling Approach Path for Google
- Computer Science
- 2017
A focused crawler where calculating the absolute frequency of the topic keyword also calculate the equivalent word and sub equivalent word of the keyword and the weight table is constructed agreeing to the user query.
Author Homepage Discovery in CiteSeerX
- Computer ScienceAAAI
- 2021
This work proposes a novel search-driven approach to build and maintain a large collection of homepages that can be used as seed URLs in any digital library including CiteSeerX to crawl scientific documents.
Efficient Focused Web Crawling Approach for Search Engine
- Computer Science
- 2015
A focused crawler traverses the web, selecting out relevant pages to a predefined topic and neglecting those out of concern and calculating the frequency of the topic keyword also calculate the synonyms and sub synonyms of the keyword.
On the Use of Web Search to Improve Scientific Collections
- Computer ScienceSDP
- 2020
This paper proposes a novel search-driven framework for acquiring documents for scientific portals using publicly-available research paper titles and author names used as queries to a Web search engine.
A Review of Focused Web Crawling Strategies
- Computer Science
- 2012
This paper reviews the researches on several focused web crawling strategies and proposes a new technique which focuses on the assignment of credits to the web pages as per its semantic contents and gives emphasis to prioritize the frontier queue so that the higher credit page URLs are given priority to crawl over lower one.
Effects of Start URLs in Focused Web Crawling
- Computer Science
- 2009
The results showed that all regions considered in this study are good starting points for focused crawling in the domains of genetics and genomics since each of them yielded a high coverage.
References
SHOWING 1-10 OF 27 REFERENCES
Panorama: extending digital libraries with topical crawlers
- Computer ScienceProceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004.
- 2004
This work proposes one such technique that uses a topical crawler driven by the information extracted from a research document to harvest a collection of Web pages that are focused on the topical subspaces associated with the given document.
Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery
- Computer ScienceComput. Networks
- 1999
PaSE: Locating Online Copy of Scientific Documents Effectively
- Computer ScienceICADL
- 2004
This paper presents a system, named as PaSE, which can effectively locate online copies (e.g., PDF or PS) of scientific documents using citation information and shows that PaSE can locate online copy of documents more accurately and conveniently than human users would do at the cost of elongated search time.
Intelligent crawling on the World Wide Web with arbitrary predicates
- Computer ScienceWWW '01
- 2001
This paper proposes the novel concept of intelligent crawling which actually learns characteristics of the linkage structure of the world wide web while performing the crawling, and refers to this technique as intelligent crawling because of its adaptive nature in adjusting to the web page linkage structure.
Finding scientific papers with homepagesearch and MOPS
- Computer ScienceSIGDOC '01
- 2001
This paper describes a new approach to seek scientific papers relevant to a pre-defined research area, which is very effective for building high-quality collections and indices of scientific papers, using ordinary desktop hardware.
Focused Crawling Using Context Graphs
- Computer ScienceVLDB
- 2000
A focused crawling algorithm is presented that builds a model for the context within which topically relevant pages occur on the web that can capture typical link hierarchies within which valuable pages occur, as well as model content on documents that frequently cooccur with relevant pages.
On Learning Strategies for Topic Specic Web Crawling
- Computer Science
- 2004
Some recent techniques for crawling web pages belonging to specific topics are discussed and some creative ways of combining different kinds of linkageand user-centered methods in order to improve the effectiveness of the crawl are discussed.
Dynamic Reference Sifting: A Case Study in the Homepage Domain
- Computer ScienceComput. Networks
- 1997
Topical web crawlers: Evaluating adaptive algorithms
- Computer ScienceTOIT
- 2004
A framework to fairly evaluate topical crawling algorithms under a number of performance metrics is developed and a novel combination of explorative and exploitative bias is found, and an evolutionary crawler is introduced that surpasses the performance of the best nonadaptive crawler after sufficiently long crawls.
Using Metadata to Enhance a Web Information Gathering System
- Computer ScienceWebDB
- 2000
This paper shows how the system uses annotations about the hyperlinks ontained in web pages to guide itself to rawl the web and builds a repository of link information that in ludes annotations is used to build quality metadata.