Learn More
We consider here the problem of building a never-ending language learner; that is, an intelligent computer agent that runs forever and that each day must (1) extract, or read, information from the web to populate a growing structured knowledge base, and (2) learn to perform this task better than on the previous day. In particular, we propose an approach and(More)
We consider the problem of semi-supervised learning to extract categories (e.g., academic fields, athletes) and relations (e.g., PlaysSport(athlete, sport)) from web pages, starting with a handful of labeled training examples of each category or relation, plus hundreds of millions of unlabeled web documents. Semi-supervised training using only a few labeled(More)
We consider semi-supervised learning of information extraction methods, especially for extracting instances of noun categories (e.g., 'athlete,' 'team') and relations (e.g., 'playsForTeam(athlete,team)'). Semi-supervised approaches using a small number of labeled examples together with many un-labeled examples are often unreliable as they frequently produce(More)
Whereas people learn many different types of knowledge from diverse experiences over many years, most current machine learning systems acquire just a single function or data model from just a single data set. We propose a never-ending learning paradigm for machine learning, to better reflect the more ambitious and encompassing type of learning performed by(More)
We describe recent extensions to the Ephyra question answering (QA) system and their evaluation in the TREC 2007 QA track. Existing syntactic answer extraction approaches for factoid and list questions have been complemented with a high-accuracy semantic approach that generates a semantic representation of the question and extracts answer candidates from(More)
A key question regarding the future of the semantic web is " how will we acquire structured information to populate the semantic web on a vast scale? " One approach is to enter this information manually. A second approach is to take advantage of pre-existing databases, and to develop common ontologies, publishing standards, and reward systems to make this(More)
This paper describes a featurized functional dependency corpus automatically derived from the Penn Treebank. Each word in the corpus is associated with over three dozen features describing the functional syntactic structure of a sentence as well as some shallow morphology. The corpus was created for use in probabilistic surface generation, but could also be(More)
We report research toward a never-ending language learning system, focusing on a first implementation which learns to classify occurrences of noun phrases according to lexical categories such as " city " and " university. " Our experiments suggest that the accuracy of classifiers produced by semi-supervised learning can be improved by coupling the learning(More)
In this paper, we describe the JAVELIN Cross Language Question Answering system, which includes modules for question analysis, keyword translation, document retrieval, information extraction and answer generation. In the NTCIR6 CLQA2 evaluation, our system achieved 19% and 13% accuracy in the English-to-Chinese and English-to-Japanese subtasks,(More)
Many event monitoring systems rely on counting known keywords in streaming text data to detect sudden spikes in frequency. But the dynamic and conversational nature of Twitter makes it hard to select known keywords for monitoring. Here we consider a method of automatically finding noun phrases (NPs) as keywords for event monitoring in Twitter. Finding NPs(More)