An Architectural Framework of a Crawler for Retrieving Highly Relevant Web Documents by Filtering Replicated Web Collections

@article{Shekhar2010AnAF,
  title={An Architectural Framework of a Crawler for Retrieving Highly Relevant Web Documents by Filtering Replicated Web Collections},
  author={Shashi Shekhar and Rohit Agrawal and Karm Veer Arya},
  journal={2010 International Conference on Advances in Computer Engineering},
  year={2010},
  pages={29-33}
}
As the Web continues to grow, it has become a difficult task to search for the relevant information using traditional search engines. There are many index based web search engines to search information in various domains on the Web. By using such search engines the retrieved documents (URLs) related to the searched topic are of poor quality also as the amount of Web pages is growing at a rapid speed, the issue of devising a personalized Web search is of great importance. This paper proposes a… 

Figures from this paper

Clustering Retrieved Web Documents to Speed Up Web Searches

TLDR
Experimental results show that QClus is effective and efficient in generating high-quality clusters of documents on specific topics with informative labels, which saves the user’s time and effort in searching for specific information of interest without having to browse through the documents one by one.

Semantic Based Image Retrieval using multi-agent model by searching and filtering replicated web images

TLDR
The proposed framework overcomes the two major problems in case of retrieving user centric images from the web: freshness problem and redundancy problem and can also be used as personalized image search engine which effectively extract the text information on the web to semantically describe the retrieved images.

Enhancing Web Search Using Query-Based Clusters and Labels

  • Rani QumsiyehYiu-Kai Ng
  • Computer Science
    2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)
  • 2013
TLDR
Experimental results show that QCL is effective and efficient in generating high-quality clusters of documents on specific topics with informative labels, which saves the user's time and effort in searching for specific information of interest without having to browse through the documents one by one.

Enhancing web search by using query-based clusters and multi-document summaries

TLDR
Experimental results show that the proposed query-based cluster and summarizer, called QSum, is effective and efficient in generating a high-quality summary for each cluster of documents on a specific topic.

Study of Web Crawler and its Different Types

TLDR
This Paper briefly reviews the concepts of web crawler, its architecture and its various types, which are an essential method for collecting data on and keeping in touch with the rapidly increasing Internet.

Extraction System Web Content Sports New Based On Web Crawler Multi Thread

TLDR
This multi-thread approach to research is used to produce web crawlers that are faster in the process of crawling sports news by involving news sources more than one address at a time.

Firefly Optimization Algorithm Based Web Scraping for Web Citation Extraction

TLDR
The primary purpose of this research is to extract author information extraction process extracts citation information published by an author, journal name, publisher, year and citation using web citation analysis.

A Performance Discussion of Web Crawler Using Some Advanced Factors

TLDR
Research is considering the error rate to enhance the accuracy and reducing space and time during web crawling considering advanced parameters such as TTL, frequency of web page visiting and spam reported by users to minimize the errors by filtering the urls on the bases of advance parameters.

Genetically optimizing query expansion for retrieving activities from the web

TLDR
An overview of the system's design which is based on semantic query expansion is given along with detailed explanation of the optimization of theSystem's parameters through the use of genetic algorithms.

Searching made easy: A multithreading based desktop search engine

TLDR
This paper proposes the faster version of the searching tool used in desktops based on the multithreading approach, the number of threads equal to theNumber of drives in the desktop are created and all the drives are searched simultaneously rather than sequentially.

References

SHOWING 1-10 OF 16 REFERENCES

Mining the Web's Link Structure

TLDR
Clever is a search engine that analyzes hyperlinks to uncover two types of pages: authorities, which provide the best source of information on a given topic; and hubs, which provides collections of links to authorities.

Intelligent crawling on the World Wide Web with arbitrary predicates

TLDR
This paper proposes the novel concept of intelligent crawling which actually learns characteristics of the linkage structure of the world wide web while performing the crawling, and refers to this technique as intelligent crawling because of its adaptive nature in adjusting to the web page linkage structure.

Searching for Hidden-Web Databases

TLDR
A new crawling strategy to automatically locate hidden-Web databases is proposed which aims to achieve a balance between the two conflicting requirements of this problem: the need to perform a broad search while at the same time avoiding theneed to crawl a large number of irrelevant pages.

An adaptive crawler for locating hidden-Web entry points

TLDR
A new framework is proposed whereby crawlers automatically learn patterns of promising links and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup andtuning.

Crawling the Hidden Web

TLDR
A generic operational model of a hidden Web crawler is introduced and how this model is realized in HiWE (Hidden Web Exposer), a prototype crawler built at Stanford is described.

Semantic Web Content Analysis: A Study in Proximity-Based Collaborative Clustering

TLDR
This study proposes an approach for binding the "semantic" facet with the usual textual one, that together constitutes a typical web page, or specifically, a semantic web document and offers a new alternative of organizing web documents which emphasizes a direct separation between the syntactic and semantic facets of the web information.

Cooperative crawling

  • M. Buzzi
  • Computer Science
    Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726)
  • 2003
TLDR
This work proposes a scheme to permit a crawler to acquire information about the global state of a Website before the crawling process takes place, which requires Web server cooperation in order to collect and publish information on its content.

Design and implementation of a high-performance distributed Web crawler

TLDR
This paper describes the design and implementation of a distributed Web crawler that runs on a network of workstations that scales to several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications.

Web mining: information and pattern discovery on the World Wide Web

TLDR
This paper defines Web mining and presents an overview of the various research issues, techniques, and development efforts, and briefly describes WEBMINER, a system for Web usage mining, and concludes the paper by listing research issues.

A scalable comparison-shopping agent for the World-Wide Web

TLDR
ShopBot, a fully-implemented, domainindependent comparison-shopping agent that relies on a combination of heuristic search, pattern matching, and inductive learning techniques, enables users to both find superior prices and substantially reduce Web shopping time.