Information retrieval on the web

  title={Information retrieval on the web},
  author={Mei Kobayashi and Koichi Takeda},
  journal={ACM Comput. Surv.},
In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical figures vary, overall trends cited by the sources are consistent and point to exponential growth in the past and in the coming decade. Hence it is not surprising that about 85% of Internet users… 

Figures from this paper

IntelligentWeb Agent for Search Engines

This paper illustrates the different types of agents, crawlers, robots,etc for mining the contents of web in a methodical, automated manner and discusses the use of crawler to gather specific types of information from Web pages, such as harvesting e-mail addresses.


The Web Information Retrieval paradigm is expounded by illustrating its basics, the components, model categories, tools, tasks and the performance measures that quantify the quality of retrieval results.


A Normalized Google Distance (NGD) algorithm, which uses Google as a semantic corpus, is introduced, which can provide a new aspect for IR research and extract the most important keywords or keyword sequences for advanced knowledge discovery.

Enhancing the Power of the Internet

Design of any new intelligent search engine should be at least based on two main motivations: design of the web environment is, for the most part, unstructured and imprecise, and a logic that supports modes of reasoning which are approximate rather than exact is needed.

Information discovery and retrieval tools

  • M. Frame
  • Computer Science
    Inf. Serv. Use
  • 2004
This session will focus on the various Internet search engines, directories, and how to improve the user experience through the use of such techniques as metadata, meta-search engines, subject specific search tools, and other developing technologies.

Information retrieval on the web

  • Kiduk Yang
  • Computer Science
    Annu. Rev. Inf. Sci. Technol.
  • 2005
Researchers in Web IR have reexamined the findings from traditional IR research to discover which conventional evaluation measures may no longer be appropriate for Web IR, where a representative test collection is all but impossible to construct.

Enhanced Web document retrieval using automatic query expansion

This work describes a scheme that attempts to remedy the situation by automatically expanding the user query through the analysis of initially retrieved documents, and experimental results to demonstrate the effectiveness of the query expansion scheme are presented.

Web Information Retrieval

This paper takes a deeper dive into the Web IR process, a variant of classical Information Retrieval, by clearly explaining its core concepts, the components, model categories, tools, tasks and the performance measures that quantify the quality of retrieval results.


The study shows that the fact that a Web search engine is very popular and able to retrieve large number of documents does not mean that it has high precision for retrieving relevant documents for its users and Africa has contributed little or nothing to the global Web contents in the field of Computer Science.

Effective Retrieval of Information in Tables on the Internet

Based on the similarity to a Web's html document, the main purpose here is to do table parsing and construct a dictionary of table indexes for applying to the information retrieval system and thus enhance the accuracy.



Searching the Web: general and scientific information access

  • S. LawrenceC. Lee Giles
  • Computer Science
    First IEEE/POPOV Workshop on Internet Technologies and Services. Proceedings (Cat. No.99EX391)
  • 1999
The World Wide Web has revolutionized the way that people access information, and has opened up new possibilities in areas such as digital libraries, general and scientific information dissemination

Learning Information Retrieval Agents: Experiments with Automated Web Browsing

A system which helps users keep abreast of new and interesting information Every day it presents a selection of interesting web pages, and the user evaluates each page, and given feedback the system adapts and attempts to produce better pages the following day.

Text and Image Metasearch on the Web

Both the text and image metasearch functions of Inquirus are surprisingly fast, and the parallel architecture of the engine that provides this efficiency is described.

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Querying multiple document collections across the internet

GlOSS, a scalable system that chooses the best document sources for a query, is designed and dSCAM an "illegal copy" metasearcher is developed that finds potential copies of a document over distributed text sources.

Human Performance on Clustering Web Pages: A Preliminary Study

An initial study of human clustering of web pages, in the hope that it would provide some insight into the difficulty of automating this task, shows that subjects did not cluster identically; in fact, any two subjects had little similarity in their web-page clusters.

An Adaptive Agent for Automated Web Browsing

A system which learns to browse the Internet on behalf of a user, which every day presents a selection of interesting Web pages and the user evaluates each page, and given this feedback the system adapts and attempts to produce better pages the following day.

Multi-Service Search and Comparison Using the MetaCrawler

The MetaCrawler provides a single, central interface for Web document searching that facilitates customization, privacy, sophisticated ltering of references, and more and serves as a tool for comparison of diverse search services.

Searching the world wide Web

The coverage and recency of the major World Wide Web search engines was analyzed, yielding some surprising results, including a lower bound on the size of the indexable Web of 320 million pages.