Search engine coverage of the OAI-PMH corpus

@article{McCown2006SearchEC,
  title={Search engine coverage of the OAI-PMH corpus},
  author={Frank McCown and Xiaoming Liu and Michael L. Nelson and Mohammad Zubair},
  journal={IEEE Internet Computing},
  year={2006},
  volume={10},
  pages={66-73}
}
Having indexed much of the "surface" Web, search engines are now using various approaches to index the "deep" Web. At the same time, institutional repositories and digital libraries are adopting the open archives initiative protocol for metadata harvesting (OAI-PMH) to expose their holdings. The authors harvested nearly 10 million records from OAI-PMH repositories. From these records, they extracted 3.3 million unique resource URLs and then conducted searches on samples from this collection to… Expand
The deep web in institutional repositories in Japan
TLDR
The extent of the deep web in Japan is calculated based on the content of searchable IRs in Japan, but using a more appropriate interval and exhaustive search with three major search engines. Expand
Agreeing to disagree: search engines and their public interfaces
TLDR
This work provides the first in depth quantitative analysis of the results produced by the Google, MSN and Yahoo API and WUI interfaces and found MSN to produce the most consistent results between their two interfaces. Expand
Indexing the web
TLDR
The most important role is played by the automatic web indexing mechanisms search engines use: the PageRank mechanism sorts the retrieved web pages according to the links they receive from other sites. Expand
Global picture of OAI-PMH repositories through the analysis of 6 key open archive meta-catalogs
TLDR
This article seeks to compare the common data in each meta-catalog and estimates which repositories are found within them and identifies the need to collate this data and improve current search tools, hence portraying the benefits of a comprehensible single unifying meta-Catalog for end users. Expand
Designing New Crawling and Indexing Techniques for Web Search Engines
TLDR
This thesis studies in a Web search engine how a crawler with limited computing resource can effectively crawl from the dynamically changing Web and acquire the most updated Web documents, and how a search engine can provide information-object-oriented indexing methods which enable users to retrieve desired information with high accuracy and high efficiency. Expand
Search Engines as an Effective Tool for Library Professionals
TLDR
The study determines the various aspects of Search engine including background of search engines, and how search engines work, and analyses the internet search techniques, i.e., basic, advanced and refine search. Expand
The Pacific Rim Library: A Surprising Pearl
TLDR
The Pacific Rim Library holds over 300,000 records harvested from OAI data provider libraries around the Pacific, and through mirroring their metadata, PRL increases the chance that their data will be discovered in Google and other general search engines. Expand
The Pacific Rim Library: A Surprising Pearl
Abstract The Pacific Rim Library (PRL) is an initiative of the Pacific Rim Digital Library Association (PRDLA). The project began in 2006 using the OAI-PMH paradigm and now holds over 300,000 recordsExpand
Extracting and Ingesting DDI Metadata and Digital Objects from a Data Archive into the iRODS Extension of the NARA TPAP Using the OAI-PMH
This prototype demonstrated that the migration of collections between digital libraries and preservation data archives is now possible using automated batch load for both data and metadata. We usedExpand
Getting Indexed by Bibliographic Databases in the Area of Computer Science
TLDR
Light is shed on the various data formats, protocols and technical requirements of getting indexed by widely used bibliographic databases in the area of computer science and provides hints for maximal database inclusion. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 30 REFERENCES
DP9: an OAI gateway service for web crawlers
Many libraries and databases are closed to general-purpose Web crawlers, and they expose their content only through their own search engines. At the same time many researchers attempt to locateExpand
White Paper: The Deep Web: Surfacing Hidden Value
TLDR
BrightPlanet's search technology automates the process of making dozens of direct queries simultaneously using multiple-thread technology and thus is the only search technology, so far, that is capable of identifying, retrieving, qualifying, classifying, and organizing both "deep" and "surface" content. Expand
Downloading textual hidden web content through keyword queries
TLDR
This paper provides a theoretical framework to investigate the query generation problem for the hidden Web and proposes effective policies for generating queries automatically and experimentally evaluates the effectiveness of these policies on 4 real hidden Web sites. Expand
Extracting structured data from Web pages
TLDR
This paper presents an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages. Expand
Crawling the Hidden Web
TLDR
A generic operational model of a hidden Web crawler is introduced and how this model is realized in HiWE (Hidden Web Exposer), a prototype crawler built at Stanford is described. Expand
The Deep Web : Surfacing Hidden Value
Traditional search engines create their indices by spidering or crawling surface Web pages. To be discovered, the page must be static and linked to other pages. Traditional search engines can notExpand
mod_oai: An Apache Module for Metadata Harvesting
We describe mod_oai, an Apache 2.0 module that implements the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). OAI-PMH is the de facto standard for metadata exchange in digitalExpand
Open Archives Initiative Protocol for Metadata Harvesting
TLDR
This paper is an introduction of the OAI protocol for metadata harvesting and the main technical idea of OAI-PMH is how to implementing the protocol. Expand
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
TLDR
A novel technique to compare HTML pages and generate a wrapper based on their similarities and dierences is developed, which confirms the feasibility of the approach on real-life data-intensive Web sites. Expand
The indexable web is more than 11.5 billion pages
TLDR
The size of the public indexable web is estimated at 11.5 billion pages and the overlap and the index size of Google, MSN, Ask/Teoma and Yahoo are estimated. Expand
...
1
2
3
...