Learn More
Hidden Markov models (HMMs) are a powerful probabilistic tool for modeling sequential data, and have been applied with success to many text-related tasks, such as part-of-speech tagging, text segmentation and information extraction. In these cases, the observations are usually mod-eled as multinomial distributions over a discrete vocabulary, and the HMM(More)
The World Wide Web is a vast source of information accessible to computers, but understandable only to humans. The goal of the research described here is to automatically create a computer understandable knowledge base whose content mirrors that of the World Wide Web. Such a knowledge base would enable much more eeective retrieval of Web information, and(More)
Recent work in machine learning for information extraction has focused on two distinct sub-problems: the conventional problem of filling template slots from natural language text, and the problem of wrapper induction, learning simple extraction procedures (" wrappers ") for highly structured text such as Web pages produced by CGI scripts. For suitably(More)
We describe an information seeking assistant for the world wide web. This agent, called WebWatcher, interactively helps users locate desired information by employing learned knowledge about which hyperlinks are likely to lead to the target information. Our primary focus to date has been on two issues: (1) organizing WebWatcher to provide interactive advice(More)
Because the World Wide Web consists primarily of text, information extraction is central to any eort that would use the Web as a resource for knowledge discovery. W e show h o w information extraction can be cast as a standard machine learning problem, and argue for the suitability of relational learning in solving it. The implementation of a(More)
Recent work on the problem of detecting synonymy through corpus analysis has used the Test of English as a Foreign Language (TOEFL) as a benchmark. However , this test involves as few as 80 questions , prompting questions regarding the statistical significance of reported results. We overcome this limitation by generating a TOEFL-like test using WordNet,(More)
We consider the problem of learning to perform information extraction in domains where linguistic processing is problematic, such as Usenet posts, email, and finger plan files. In place of syntactic and semantic information, other sources of information can be used, such as term frequency, typography, formatting, and mark-up. We describe four learning(More)
This paper investigates whether a machine can automatically learn the task of finding, within a large collection of candidate responses, the answers to questions. The learning process consists of inspecting a collection of answered questions and characterizing the relation between question and answer with a statistical model. For the purpose of learning(More)
Information extraction (IE) is the problem of lling out pre-deened structured summaries from text documents. We are interested in performing IE in non-traditional domains, where much of the text is often ungrammatical, such as electronic bulletin board posts and Web pages. We suggest that the best approach is one that takes into account many diierent kinds(More)
1 We explore the notion of a tour guide software agent for assisting users browsing the World Wide Web. A Web tour guide agent provides assistance similar to that provided by a h uman tour guide in a museum it guides the user along an appropriate path through the collection , based on its knowledge of the user's interests , of the location and relevance of(More)