Ashutosh Dixit

Learn More
The World Wide Web is an interlinked collection of billions of documents. Ironically the very size of this collection has become an obstacle for information retrieval. The user has to sift through scores of pages to come upon the information he/she desires. Web crawlers are the heart of search engines. <b>Mercator</b> is a scalable, extensible web crawler,(More)
The World Wide Web is a huge source of hyperlinked information contained in hypertext documents. Search engines use web crawlers to collect these documents from web for the purpose of storage and indexing. However, many of these documents contain dynamic information which gets changed on daily, weekly, monthly or yearly basis and hence we need to refresh(More)
Study reports that about 40&#x025; of current internet traffic and bandwidth consumption is due to the web crawlers that retrieve pages for indexing by the different search engines. As the size of the web continues to grow, searching it for useful information has become increasingly difficult. The centralized crawling techniques are unable to cope up with(More)
WWW's expansion coupled with high change frequency of web pages poses a challenge for maintaining and fetching up-to-date information. The traditional crawling methods are no longer catch up with this updating and growing web. Alternative distributed crawling scheme that uses migrating crawlers try to maximize the network utilization by minimizing the(More)
WWW is a decentralized, distributed and heterogeneous information resource. With increased availability of information through WWW, it is very difficult to read all documents to retrieve the desired results; therefore there is a need of summarization methods which can help in providing contents of a given document in a precise manner. Keywords of a document(More)
Question answering system can be seen as the next step in information retrieval, allowing users to pose question in natural language and receive compact answers. For the Question answering system to be successful, research has shown that the correct classification of question with respect to the expected answer type is requisite. We propose a novel(More)
Due to the lack of efficient refresh techniques, current crawlers add unnecessary traffic to the already overloaded Internet. Frequency of visits to sites can be optimized by calculating refresh time dynamically. It helps in improving the effectiveness of the crawling system by efficiently managing the revisiting frequency of a website; and appropriate(More)
Deep Web is content hidden behind HTML forms. Since it represents a large portion of the structured, unstructured and dynamic data on the Web, accessing Deep-Web content has been a long challenge for the database community. This paper describes a crawler for accessing Deep-Web using Ontologies. Performance evaluation of the proposed work showed that this(More)
A general crawler downloads web pages that may be of any kind, thus forming a source of information for the search engine. Blog crawler is similar to a general crawler except that it restricts its crawl boundary to the blog space, thus downloading only the blog pages and ignoring rest of the web. Since blog is an emerging phenomenon and serve as very useful(More)