Learn More
Traditional noun phrase coreference resolution systems represent features only of pairs of noun phrases. In this paper, we propose a machine learning method that enables features over sets of noun phrases, resulting in a first-order proba-bilistic model for coreference. We outline a set of approximations that make this approach practical, and apply our(More)
We present SampleRank, an alternative to con-trastive divergence (CD) for estimating parameters in complex graphical models. SampleR-ank harnesses a user-provided loss function to distribute stochastic gradients across an MCMC chain. As a result, parameter updates can be computed between arbitrary MCMC states. Sam-pleRank is not only faster than CD, but(More)
Recently, many advanced machine learning approaches have been proposed for coreference resolution; however, all of the discriminatively-trained models reason over mentions rather than entities. That is, they do not explicitly contain variables indicating the " canonical " values for each attribute of an entity (e.g., name, venue, title, etc.). This(More)
The automatic consolidation of database records from many heterogeneous sources into a single repository requires solving several information integration tasks. Although tasks such as coreference, schema matching, and canonicalization are closely related, they are most commonly studied in isolation. Systems that do tackle multiple integration problems(More)
Modern optical character recognition software relies on human interaction to correct misrecognized characters. Even though the software often reliably identifies low-confidence output, the simple language and vocabulary models employed are insufficient to automatically correct mistakes. This paper demonstrates that topic models, which automatically detect(More)
Cross-document coreference, the problem of resolving entity mentions across multi-document collections, is crucial to automated knowledge base construction and data mining tasks. However, the scarcity of large labeled data sets has hindered supervised machine learning research for this task. In this paper we develop and demonstrate an approach based on "(More)
Incorporating probabilities into the semantics of incomplete databases has posed many challenges, forcing systems to sacrifice modeling power, scalability, or treatment of relational algebra operators. We propose an alternative approach where the underlying relational database always represents a single world, and an external factor graph encodes a(More)
Methods that measure compatibility between mention pairs are currently the dominant approach to coreference. However, they suffer from a number of drawbacks including difficulties scaling to large numbers of mentions and limited representational power. As the severity of these drawbacks continue to progress with the growing demand for more data, the need to(More)
In the database community, work on information extraction (IE) has centered on two themes: how to effectively manage IE tasks, and how to manage the uncertainties that arise in the IE process in a scalable manner. Recent work has proposed a probabilistic database (PDB) based declarative IE system that supports a leading statistical IE model, and an(More)
Named-entity recognition systems extract entities such as people, organizations, and locations from unstructured text. Rather than extract these mentions in isolation, this paper presents a record extraction system that assembles mentions into records (i.e. database tuples). We construct a probabilistic model of the compatibility between field values, then(More)