What's there and what's not?: focused crawling for missing documents in digital libraries

@article{Zhuang2005WhatsTA,
  title={What's there and what's not?: focused crawling for missing documents in digital libraries},
  author={Ziming Zhuang and Rohit Wagle and C. Lee Giles},
  journal={Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05)},
  year={2005},
  pages={301-310}
}
Some large scale topical digital libraries, such as CiteSeer, harvest online academic documents by crawling open-access archives, university and author homepages, and authors' self-submissions. While these approaches have so far built reasonable size libraries, they can suffer from having only a portion of the documents from specific publishing venues. We propose to use alternative online resources and techniques that maximally exploit other resources to build the complete document collection… 
An Approach for Focused Crawler to Harvest Digital Academic Documents in Online Digital Libraries
TLDR
With therapidﻷ growthﻵ�growth﻽�ofﻴdigital-digital- digital-digitalﻹ information-and-user-needs, £1.5bn-worth of assets are expected to be created within the next 12 months.
Focused Crawling : A Means to Acquire Biological Data from the Web
TLDR
The main features of focused crawling are described, the research on focused crawling conducted by the research group of the author is discussed, and the problem areas associated with focused crawling not discussed in the literature are discussed.
Effective Concentrated Web Crawling Approach Path for Google
TLDR
A focused crawler where calculating the absolute frequency of the topic keyword also calculate the equivalent word and sub equivalent word of the keyword and the weight table is constructed agreeing to the user query.
Author Homepage Discovery in CiteSeerX
TLDR
This work proposes a novel search-driven approach to build and maintain a large collection of homepages that can be used as seed URLs in any digital library including CiteSeerX to crawl scientific documents.
Efficient Focused Web Crawling Approach for Search Engine
TLDR
A focused crawler traverses the web, selecting out relevant pages to a predefined topic and neglecting those out of concern and calculating the frequency of the topic keyword also calculate the synonyms and sub synonyms of the keyword.
On the Use of Web Search to Improve Scientific Collections
TLDR
This paper proposes a novel search-driven framework for acquiring documents for scientific portals using publicly-available research paper titles and author names used as queries to a Web search engine.
A Review of Focused Web Crawling Strategies
TLDR
This paper reviews the researches on several focused web crawling strategies and proposes a new technique which focuses on the assignment of credits to the web pages as per its semantic contents and gives emphasis to prioritize the frontier queue so that the higher credit page URLs are given priority to crawl over lower one.
Effects of Start URLs in Focused Web Crawling
TLDR
The results showed that all regions considered in this study are good starting points for focused crawling in the domains of genetics and genomics since each of them yielded a high coverage.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 27 REFERENCES
Panorama: extending digital libraries with topical crawlers
TLDR
This work proposes one such technique that uses a topical crawler driven by the information extracted from a research document to harvest a collection of Web pages that are focused on the topical subspaces associated with the given document.
Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery
PaSE: Locating Online Copy of Scientific Documents Effectively
TLDR
This paper presents a system, named as PaSE, which can effectively locate online copies (e.g., PDF or PS) of scientific documents using citation information and shows that PaSE can locate online copy of documents more accurately and conveniently than human users would do at the cost of elongated search time.
Intelligent crawling on the World Wide Web with arbitrary predicates
TLDR
This paper proposes the novel concept of intelligent crawling which actually learns characteristics of the linkage structure of the world wide web while performing the crawling, and refers to this technique as intelligent crawling because of its adaptive nature in adjusting to the web page linkage structure.
Finding scientific papers with homepagesearch and MOPS
TLDR
This paper describes a new approach to seek scientific papers relevant to a pre-defined research area, which is very effective for building high-quality collections and indices of scientific papers, using ordinary desktop hardware.
Focused Crawling Using Context Graphs
TLDR
A focused crawling algorithm is presented that builds a model for the context within which topically relevant pages occur on the web that can capture typical link hierarchies within which valuable pages occur, as well as model content on documents that frequently cooccur with relevant pages.
On Learning Strategies for Topic Specic Web Crawling
TLDR
Some recent techniques for crawling web pages belonging to specific topics are discussed and some creative ways of combining different kinds of linkageand user-centered methods in order to improve the effectiveness of the crawl are discussed.
Dynamic Reference Sifting: A Case Study in the Homepage Domain
Topical web crawlers: Evaluating adaptive algorithms
TLDR
A framework to fairly evaluate topical crawling algorithms under a number of performance metrics is developed and a novel combination of explorative and exploitative bias is found, and an evolutionary crawler is introduced that surpasses the performance of the best nonadaptive crawler after sufficiently long crawls.
Using Metadata to Enhance a Web Information Gathering System
TLDR
This paper shows how the system uses annotations about the hyperlinks ontained in web pages to guide itself to rawl the web and builds a repository of link information that in ludes annotations is used to build quality metadata.
...
1
2
3
...