• Publications
  • Influence
Word Association Norms, Mutual Information and Lexicography
TLDR
The proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, making it possible to estimate norms for tens of thousands of words. Expand
A Program for Aligning Sentences in Bilingual Corpora
TLDR
This paper will describe a method and a program for aligning sentences based on a simple statistical model of character lengths, which uses the fact that longer sentences in one language tend to be translated into longer sentence in the other language, and that shorter sentences tend to been translated into shorter sentences. Expand
A stochastic parts program and noun phrase parser for unrestricted text
TLDR
A program that tags each word in an input sentence with the most likely part of speech has been written and performance is encouraging; a 400-word sample is presented and is judged to be 99.5% correct. Expand
Very sparse random projections
TLDR
This paper proposes sparse random projections, an approximate algorithm for estimating distances between pairs of points in a high-dimensional vector space that multiplies A by a random matrix R in RD x k, reducing the D dimensions down to just k for speeding up the computation. Expand
One Sense Per Discourse
TLDR
An experiment confirmed the hypothesis that if a polysemous word such as sentence appears two or more times in a well-written discourse, it is extremely likely that they will all share the same sense and found that the tendency to share sense in the same discourse is extremely strong. Expand
Using Statistics in Lexical Analysis
TLDR
The computational tools available for studying machine-readable corpora are at present still rather primitive and use these corpora and the basic concordancing tool mentioned above to fill in detailed syntactic descriptions (prompting a move, towards more thorough descriptions of lexical syntax). Expand
A method for disambiguating word senses in a large corpus
TLDR
The proposed method was designed to disambiguate senses that are usually associated with different topics using a Bayesian argument that has been applied successfully in related tasks such as author identification and information retrieval. Expand
Query suggestion using hitting time
TLDR
A novel query suggestion algorithm based on ranking queries with the hitting time on a large scale bipartite graph that can successfully boost long tail queries, accommodating personalized query suggestion, as well as finding related authors in research. Expand
Inverse Document Frequency (IDF): A Measure of Deviations from Poisson
TLDR
In inverse document frequency (IDF), a quantity borrowed from Information Retrieval, is used to distinguish words like somewhat and boycott, but boycott is a better keyword because its IDF is farther from what would be expected by chance (Poisson). Expand
Poisson mixtures
TLDR
The proposed Poisson mixture captures much of this heterogeneous structure by allowing the Poisson parameter θ to vary over documents subject to a density function φ, intended to capture dependencies on hidden variables such genre, author, topic, etc. Expand
...
1
2
3
4
5
...