Learn More
Traditional noun phrase coreference resolution systems represent features only of pairs of noun phrases. In this paper, we propose a machine learning method that enables features over sets of noun phrases, resulting in a first-order proba-bilistic model for coreference. We outline a set of approximations that make this approach practical, and apply our(More)
We present SampleRank, an alternative to con-trastive divergence (CD) for estimating parameters in complex graphical models. SampleR-ank harnesses a user-provided loss function to distribute stochastic gradients across an MCMC chain. As a result, parameter updates can be computed between arbitrary MCMC states. Sam-pleRank is not only faster than CD, but(More)
Recently, many advanced machine learning approaches have been proposed for coreference resolution; however, all of the discriminatively-trained models reason over mentions rather than entities. That is, they do not explicitly contain variables indicating the " canonical " values for each attribute of an entity (e.g., name, venue, title, etc.). This(More)
The automatic consolidation of database records from many heterogeneous sources into a single repository requires solving several information integration tasks. Although tasks such as coreference, schema matching, and canonicalization are closely related, they are most commonly studied in isolation. Systems that do tackle multiple integration problems(More)
Conditional random fields and other graphi-cal models have achieved state of the art results in a variety of tasks such as coreference, relation extraction, data integration, and parsing. Increasingly, practitioners are using models with more complex structure—higher tree-width, larger fan-out, more features, and more data—rendering even approximate(More)
Methods that measure compatibility between mention pairs are currently the dominant approach to coreference. However, they suffer from a number of drawbacks including difficulties scaling to large numbers of mentions and limited representational power. As the severity of these drawbacks continue to progress with the growing demand for more data, the need to(More)
Modern optical character recognition software relies on human interaction to correct misrecognized characters. Even though the software often reliably identifies low-confidence output, the simple language and vocabulary models employed are insufficient to automatically correct mistakes. This paper demonstrates that topic models, which automatically detect(More)
Cross-document coreference, the problem of resolving entity mentions across multi-document collections, is crucial to automated knowledge base construction and data mining tasks. However, the scarcity of large labeled data sets has hindered supervised machine learning research for this task. In this paper we develop and demonstrate an approach based on "(More)
Incorporating probabilities into the semantics of incomplete databases has posed many challenges, forcing systems to sacrifice modeling power, scalability, or treatment of relational algebra operators. We propose an alternative approach where the underlying relational database always represents a single world, and an external factor graph encodes a(More)