Learn More
Open Information Extraction (IE) is the task of extracting assertions from massive corpora without requiring a pre-specified vocabulary. This paper shows that the output of state-of-the-art Open IE systems is rife with uninfor-mative and incoherent extractions. To overcome these problems, we introduce two simple syntactic and lexical constraints on binary(More)
How do we scale information extraction to the massive size and unprecedented heterogeneity of the Web corpus? Beginning in 2003, our KnowItAll project has sought to extract high-quality knowledge from the Web. In 2007, we introduced the Open Information Extraction (Open IE) paradigm which eschews hand-labeled training examples, and avoids domain-specific(More)
We study question answering as a machine learning problem, and induce a function that maps open-domain questions to queries over a database of web extractions. Given a large, community-authored, question-paraphrase corpus, we demonstrate that it is possible to learn a semantic lexicon and linear ranking function without manually annotating questions. Our(More)
The old Asian legend about the blind men and the elephant comes to mind when looking at how different authors of scientific papers describe a piece of related prior work. It turns out that different citations to the same paper often focus on different aspects of that paper and that neither provides a full description of its full set of contributions. In(More)
We consider the problem of open-domain question answering (Open QA) over massive knowledge bases (KBs). Existing approaches use either manually curated KBs like Freebase or KBs automatically extracted from unstructured text. In this paper, we present OQA, the first approach to leverage both curated and extracted KBs. A key technical challenge is designing(More)
This paper investigates the " named-entity disam-biguation " task on the Web—identifying the refer-ent of a string, found on an arbitrary Web page. The GROUNDER system, introduced in this paper, addresses two challenges not considered by previous work: how to utilize a priori information (e.g., Bill Clinton is more prominent on the Web than Clin-ton County)(More)
During implementation of a new Do-Not-Resuscitate (DNR) policy in New York State, decisions by 233 nursing home patients of their surrogates were evaluated. Eighteen patients with capacity (mean age +/- SD = 76.4 +/- 12.1 years) chose DNR; 30 patients with capacity (mean age +/- SD = 76.2 +/- 10.7 years) chose to be resuscitated (CODE); 54 patients without(More)
We introduce a technique for identifying the most salient participants in a discussion. Our method, MavenRank is based on lexical cen-trality: a random walk is performed on a graph in which each node is a participant in the discussion and an edge links two participants who use similar rhetoric. As a test, we used MavenRank to identify the most influential(More)
This study presents the Chinese Open Relation Extraction (CORE) system that is able to extract entity-relation triples from Chinese free texts based on a series of NLP techniques, i.e., word segmentation, POS tagging, syntactic parsing, and extraction rules. We employ the proposed CORE techniques to extract more than 13 million entity-relations for an open(More)