An Architectural Framework of a Crawler for Retrieving Highly Relevant Web Documents by Filtering Replicated Web Collections

@article{Shekhar2010AnAF,
  title={An Architectural Framework of a Crawler for Retrieving Highly Relevant Web Documents by Filtering Replicated Web Collections},
  author={Shashi Shekhar and Rohit Agrawal and Karm Veer Arya},
  journal={2010 International Conference on Advances in Computer Engineering},
  year={2010},
  pages={29-33}
}
As the Web continues to grow, it has become a difficult task to search for the relevant information using traditional search engines. There are many index based web search engines to search information in various domains on the Web. By using such search engines the retrieved documents (URLs) related to the searched topic are of poor quality also as the amount of Web pages is growing at a rapid speed, the issue of devising a personalized Web search is of great importance. This paper proposes a… 

Figures from this paper

A WEBIR Crawling Framework for Retrieving Highly Relevant Web Documents: Evaluation Based on Rank Aggregation and Result Merging Algorithms

Proposed Content Based Result Aggregation (CBRA) algorithm outperforms other existing content based merging algorithms based on full document content and simple result merging strategies can outperform Google, Yahoo and MSN Live.

Clustering Retrieved Web Documents to Speed Up Web Searches

Experimental results show that QClus is effective and efficient in generating high-quality clusters of documents on specific topics with informative labels, which saves the user’s time and effort in searching for specific information of interest without having to browse through the documents one by one.

Semantic Based Image Retrieval using multi-agent model by searching and filtering replicated web images

The proposed framework overcomes the two major problems in case of retrieving user centric images from the web: freshness problem and redundancy problem and can also be used as personalized image search engine which effectively extract the text information on the web to semantically describe the retrieved images.

Enhancing Web Search Using Query-Based Clusters and Labels

  • Rani QumsiyehYiu-Kai Ng
  • Computer Science
    2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)
  • 2013
Experimental results show that QCL is effective and efficient in generating high-quality clusters of documents on specific topics with informative labels, which saves the user's time and effort in searching for specific information of interest without having to browse through the documents one by one.

Enhancing web search by using query-based clusters and multi-document summaries

Experimental results show that QSum generates a concise/comprehensive summary for each cluster of documents retrieved in response to a user query, which saves the user’s time and effort in searching for specific information of interest without having to browse through the documents one by one.

Study of Web Crawler and its Different Types

This Paper briefly reviews the concepts of web crawler, its architecture and its various types, which are an essential method for collecting data on and keeping in touch with the rapidly increasing Internet.

Extraction System Web Content Sports New Based On Web Crawler Multi Thread

This multi-thread approach to research is used to produce web crawlers that are faster in the process of crawling sports news by involving news sources more than one address at a time.

Firefly Optimization Algorithm Based Web Scraping for Web Citation Extraction

The primary purpose of this research is to extract author information extraction process extracts citation information published by an author, journal name, publisher, year and citation using web citation analysis.

A Performance Discussion of Web Crawler Using Some Advanced Factors

Research is considering the error rate to enhance the accuracy and reducing space and time during web crawling considering advanced parameters such as TTL, frequency of web page visiting and spam reported by users to minimize the errors by filtering the urls on the bases of advance parameters.

Genetically optimizing query expansion for retrieving activities from the web

An overview of the system's design which is based on semantic query expansion is given along with detailed explanation of the optimization of theSystem's parameters through the use of genetic algorithms.

References

SHOWING 1-10 OF 16 REFERENCES

Mining the Web's Link Structure

Clever is a search engine that analyzes hyperlinks to uncover two types of pages: authorities, which provide the best source of information on a given topic; and hubs, which provides collections of links to authorities.

Intelligent crawling on the World Wide Web with arbitrary predicates

This paper proposes the novel concept of intelligent crawling which actually learns characteristics of the linkage structure of the world wide web while performing the crawling, and refers to this technique as intelligent crawling because of its adaptive nature in adjusting to the web page linkage structure.

Searching for Hidden-Web Databases

A new crawling strategy to automatically locate hidden-Web databases is proposed which aims to achieve a balance between the two conflicting requirements of this problem: the need to perform a broad search while at the same time avoiding theneed to crawl a large number of irrelevant pages.

An adaptive crawler for locating hidden-Web entry points

A new framework is proposed whereby crawlers automatically learn patterns of promising links and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup andtuning.

Semantic Web Content Analysis: A Study in Proximity-Based Collaborative Clustering

This study proposes an approach for binding the "semantic" facet with the usual textual one, that together constitutes a typical web page, or specifically, a semantic web document and offers a new alternative of organizing web documents which emphasizes a direct separation between the syntactic and semantic facets of the web information.

Cooperative crawling

  • M. Buzzi
  • Computer Science
    Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726)
  • 2003
This work proposes a scheme to permit a crawler to acquire information about the global state of a Website before the crawling process takes place, which requires Web server cooperation in order to collect and publish information on its content.

Design and implementation of a high-performance distributed Web crawler

This paper describes the design and implementation of a distributed Web crawler that runs on a network of workstations that scales to several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications.

Web mining: information and pattern discovery on the World Wide Web

This paper defines Web mining and presents an overview of the various research issues, techniques, and development efforts, and briefly describes WEBMINER, a system for Web usage mining, and concludes the paper by listing research issues.

A scalable comparison-shopping agent for the World-Wide Web

ShopBot, a fully-implemented, domainindependent comparison-shopping agent that relies on a combination of heuristic search, pattern matching, and inductive learning techniques, enables users to both find superior prices and substantially reduce Web shopping time.

Incorporating agent based neural network model for adaptive meta-search

This approach uses an adaptive agent based neural network model to improve the quality of the search results by incorporating user relevance feedback in to the system.