Learn More
This paper presents AnCora, a multilingual corpus annotated at different linguistic levels consisting of 500,000 words in Catalan (AnCora-Ca) and in Spanish (AnCora-Es). At present AnCora is the largest multilayer annotated corpus of these languages freely available from http://clic.ub.edu/ancora. The two corpora consist mainly of newspaper texts annotated(More)
1 Motivation The three motivations behind the Rand index [4], a general clustering evaluation metric, can be rephrased in coreference terms: (i) every mention is unequivocably assigned to a specific entity; (ii) entities are defined just as much by those mentions which they do not contain as by those mentions which they do contain; and (iii) all mentions(More)
We introduce a novel coreference resolution system that models entities and events jointly. Our iterative method cautiously constructs clusters of entity and event mentions using linear regression to model cluster merge operations. As clusters are built, information flows between entity and event clusters through features that model semantic role(More)
A discourse typically involves numerous entities , but few are mentioned more than once. Distinguishing discourse entities that die out after just one mention (singletons) from those that lead longer lives (coreferent) would benefit NLP applications such as coreference resolution , protagonist identification, topic mod-eling, and discourse coherence. We(More)
The definitions of two coreference scoring metrics—B 3 and CEAF—are underspeci-fied with respect to predicted, as opposed to key (or gold) mentions. Several variations have been proposed that manipulate either, or both, the key and predicted mentions in order to get a one-to-one mapping. On the other hand, the metric BLANC was, until recently, limited to(More)
This paper explores the effect that different corpus configurations have on the performance of a coreference resolution system, as measured by MUC, B 3 , and CEAF. By varying separately three parameters (language, annotation scheme, and preprocessing information) and applying the same coreference resolution system, the strong bonds between system and corpus(More)
Unbiased language is a requirement for reference sources like encyclopedias and scientific texts. Bias is, nonetheless, ubiquitous , making it crucial to understand its nature and linguistic realization and hence detect bias automatically. To this end we analyze real instances of human edits designed to remove bias from Wikipedia articles. The analysis(More)
The task of coreference resolution requires people or systems to decide when two referring expressions refer to the 'same' entity or event. In real text, this is often a difficult decision because identity is never adequately defined, leading to contradictory treatment of cases in previous work. This paper introduces the concept of 'near-identity', a middle(More)
Coreference resolution systems rely heavily on string overlap (e.g., Google Inc. and Google), performing badly on mentions with very different words (opaque mentions) like Google and the search giant. Yet prior attempts to resolve opaque pairs using ontolo-gies or distributional semantics hurt precision more than improved recall. We present a new(More)