Learn More
This article presents a measure of semantic similarity in an is-a taxonomy based on the notion of shared information content. Experimental evaluation against a benchmark set of human similarity judgments demonstrates that the measure performs better than the traditional edge-counting approach. The article presents algorithms that take advantage of taxonomic(More)
Broad coverage, high quality parsers are available for only a handful of languages. A prerequisite for developing broad coverage parsers for more languages is the annotation of text with the desired linguistic representations (also known as “treebanking”). However, syntactic annotation is a labor intensive and time-consuming process, and it is difficult to(More)
Minimum-error-rate training (MERT) is a bottleneck for current development in statistical machine translation because it is limited in the number of weights it can reliably optimize. Building on the work of Watanabe et al., we explore the use of the MIRA algorithm of Crammer et al. as an alternative to MERT. We first show that by parallel processing and(More)
Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of(More)
The absence of training data is a real problem for corpus-based approaches to sense disambiguation, one that is unlikely to be solved soon. Selectional preference is traditionally connected with sense ambiguity; this paper explores how a statistical model of selectional preference, requiring neither manual annotation of selection restrictions nor supervised(More)
A new, information-theoretic model of selectional constraints is proposed. The strategy adopted here is a minimalist one: how far can one get making as few assumptions as possible? In keeping with that strategy, the proposed model consists of only two components: first, a fairly generic taxonomic representation of concepts, and, second, a probabilistic(More)