Corpus ID: 15039623

Building Domain-Specific Search Engines with Machine Learning Techniques

@inproceedings{McCallum1999BuildingDS,
  title={Building Domain-Specific Search Engines with Machine Learning Techniques},
  author={Andrew McCallum and Kamal Nigam and Jason D. M. Rennie and Kristie Seymore},
  year={1999}
}
Domain-specific search engines are growing in popularity because they offer increased accuracy and extra functionality not possible with the general, Web-wide search engines. For example, www.campsearch.com allows complex queries by age-group, size, location and cost over .summer camps. Unfortunately these domain-specific search engines are difficult and timeconsuming to maintain. This paper proposes the use of machine learning techniques to greatly automate the creation and maintenance of… 
A Machine Learning Approach to Building Domain-Specific Search Engines
TLDR
The use of machine learning techniques are proposed to greatly automate the creation and maintenance of domain-specific search engines and new research in reinforcement learning, text classification and information extraction that enables efficient spidering, populates topic hierarchies, and identifies informative text segments is described.
Study on domain-specific search engine and its automated generation
As web information expands, vertical search engine plays a more and more important role in search industry. With ldquobeing the best in specialized utilization fieldrdquo in mind, we develop
Machine learning, data mining, and the World Wide Web : design of special-purpose search engines
We present DEADLINER, a special-purpose search engine that indexes conference and workshop announcements, and which extracts a range of academic information from the Web. SVMs provide an efficient
Binary Feature Selection and Integration in Specialized Search Engines
We present a methodology for rapid implementation of specia lized search engines. To catalog data, these search engines interpret and classify the content of w eb material to identify different
Intelligent Web topics search using early detection and data analysis
  • Ching-Cheng Lee, Yixin Yang
  • Computer Science
    Proceedings 27th Annual International Computer Software and Applications Conference. COMPAC 2003
  • 2003
TLDR
This project does early detection for "candidate topics" while extracting words from the HTML text and performs data analysis on the appearance information such as appearance times and places for candidate topics to reduce candidate topics' crawling times and computing cost.
Extracting Domain-Specific Concepts to Enhance Search Accuracy
  • Jun Gong, Lu Liu
  • Computer Science
    2009 First International Conference on Information Science and Engineering
  • 2009
TLDR
This paper presents a concept based retrieval method to deal with domain-specific searches and shows that this approach is effective in improving domain- specific search accuracy.
Compiling document collections from the Internet
TLDR
Results suggest that the rough estimation of precision and recall calculated in this study offer great promise, and that the time required for manual analysis of document content by the crawler was significantly reduced.
Towards next generation vertical search engines
OF THE DISSERTATION TOWARDS NEXT GENERATION VERTICAL SEARCH ENGINES by Li Zheng Florida International University, 2014 Miami, Florida Professor Tao Li, Co-major Professor Professor Shu-Ching Chen,
Semantic domain specific search engine
TLDR
A new hypertext resource discovery system called topic specific crawler is described, to selectively seek out pages that are relevant to a predefined set of topics, rather than collecting and indexing all accessible web documents to be able to answer all possible ad-hoc queries.
Metadata based Web mining for relevance
  • J. Yi, Neel Sundaresan
  • Computer Science
    Proceedings 2000 International Database Engineering and Applications Symposium (Cat. No.PR00789)
  • 2000
TLDR
A topic-specific search engine that require significantly less human labor but perform almost as well as topic- specific search engines whose content is maintained by humans is built.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 29 REFERENCES
A Machine Learning Architecture for Optimizing Web Search Engines
TLDR
A wide range of heuristics for adjusting document rankings based on the special HTML structure of Web documents are described, including a novel one inspired by reinforcement learning techniques for propagating rewards through a graph which can be used to improve a search engine's rankings.
Learning to Extract Symbolic Knowledge from the World Wide Web
TLDR
The goal of the research described here is to automatically create a computer understandable world wide knowledge base whose content mirrors that of the World Wide Web, and several machine learning algorithms for this task are described.
Learning Page-Independent Heuristics for Extracting Data from Web Pages
TLDR
A method for learning general, page-independent heuristics for extracting data from HTML documents that can substantially improve the performance of methods for learning page-specific wrappers.
CiteSeer: an autonomous Web agent for automatic retrieval and identification of interesting publications
TLDR
A Web based information agent that assists the user in the process of performing a scientific literature search and can find papers which are similar to a given paper using word information and byanalyzing common citations made by the papers.
A Web-based information system that reasons with structured collections of text
TLDR
Experimental evidence is given showing that many information sources can be easily modeled with WHIRL, and that inferences in the logic are both accurate and eecient.
Improving Text Classification by Shrinkage in a Hierarchy of Classes
TLDR
This paper shows that the accuracy of a naive Bayes text classi er can be improved by taking advantage of a hierarchy of classes, and adopts an established statistical technique called shrinkage that smoothes parameter estimates of a data-sparse child with its parent in order to obtain more robust parameter estimates.
Information Extraction Using Hidden Markov Models
TLDR
This thesis shows how to design and tune a hidden Markov model to extract factual information from a corpus of machine-readable English prose and presents a HMM that classifies and parses natural language assertions about genes being located at particular positions on chromosomes.
WebWatcher : A Tour Guide for the World Wide Web
1 We explore the notion of a tour guide software agent for assisting users browsing the World Wide Web. A Web tour guide agent provides assistance similar to that provided by a human tour guide in a
Statistical language learning
TLDR
Eugene Charniak points out that as a method of attacking NLP problems, the statistical approach has several advantages and is grounded in real text and therefore promises to produce usable results, and it offers an obvious way to approach learning.
ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery
TLDR
This analysis highlights an interesting feature of the Web environment that bodes well for ARACH-NID's search methods and discusses the role played in both by user relevance feedback and unsupervised learning by individual agents.
...
1
2
3
...