Justin Betteridge

Learn More
We consider here the problem of building a never-ending language learner; that is, an intelligent computer agent that runs forever and that each day must (1) extract, or read, information from the web to populate a growing structured knowledge base, and (2) learn to perform this task better than on the previous day. In particular, we propose an approach and(More)
We consider the problem of semi-supervised learning to extract categories (e.g., academic fields, athletes) and relations (e.g., PlaysSport(athlete, sport)) from web pages, starting with a handful of labeled training examples of each category or relation, plus hundreds of millions of unlabeled web documents. Semi-supervised training using only a few labeled(More)
Whereas people learn many different types of knowledge from diverse experiences over many years, most current machine learning systems acquire just a single function or data model from just a single data set. We propose a neverending learning paradigm for machine learning, to better reflect the more ambitious and encompassing type of learning performed by(More)
We consider semi-supervised learning of information extraction methods, especially for extracting instances of noun categories (e.g., ‘athlete,’ ‘team’) and relations (e.g., ‘playsForTeam(athlete,team)’). Semisupervised approaches using a small number of labeled examples together with many unlabeled examples are often unreliable as they frequently produce(More)
We report research toward a never-ending language learning system, focusing on a first implementation which learns to classify occurrences of noun phrases according to lexical categories such as “city” and “university.” Our experiments suggest that the accuracy of classifiers produced by semi-supervised learning can be improved by coupling the learning of(More)
We describe recent extensions to the Ephyra question answering (QA) system and their evaluation in the TREC 2007 QA track. Existing syntactic answer extraction approaches for factoid and list questions have been complemented with a high-accuracy semantic approach that generates a semantic representation of the question and extracts answer candidates from(More)
Many event monitoring systems rely on counting known keywords in streaming text data to detect sudden spikes in frequency. But the dynamic and conversational nature of Twitter makes it hard to select known keywords for monitoring. Here we consider a method of automatically finding noun phrases (NPs) as keywords for event monitoring in Twitter. Finding NPs(More)
A key question regarding the future of the semantic web is “how will we acquire structured information to populate the semantic web on a vast scale?” One approach is to enter this information manually. A second approach is to take advantage of pre-existing databases, and to develop common ontologies, publishing standards, and reward systems to make this(More)
Distant supervision (DS) is a method for training sentence-level information extraction models using only an unlabeled corpus and a knowledge base (KB). Fundamental to many DS approaches is the assumption that KB facts are expressed at least once (EALO) in the text corpus. Often, however, KB facts are actually expressed in the corpus many times, in which(More)
In this paper, we describe the JAVELIN Cross Language Question Answering system, which includes modules for question analysis, keyword translation, document retrieval, information extraction and answer generation. In the NTCIR6 CLQA2 evaluation, our system achieved 19% and 13% accuracy in the English-to-Chinese and English-to-Japanese subtasks,(More)