This paper describes a new statistical parser which is based on probabilities of dependencies between head-words in the parse tree. Standard bigram probability est imation techniques are extended to calculate probabilities of dependencies between pairs of words. Tests using Wall Street Journal data show that the method performs at least as well as SPATTER (Magerman 95; Jelinek et al. 94), which has the best published results for a statistical parser on this task. The simplicity of the approach means the model trains on 40,000 sentences in under 15 minutes. With a beam search strategy parsing speed can be improved to over 200 sentences a minute with negligible loss in accuracy. 1 I n t r o d u c t i o n Lexical information has been shown to be crucial for many parsing decisions, such as prepositional-phrase at tachment (for example (Hindle and Rooth 93)). However, early approaches to probabilistic parsing (Pereira and Schabes 92; Magerman and Marcus 91; Briscoe and Carroll 93) conditioned probabilities on non-terminal labels and part of speech tags alone. The SPATTER parser (Magerman 95; 3elinek et ah 94) does use lexical information, and recovers labeled constituents in Wall Street Journal text with above 84% accuracy as far as we know the best published results on this task. This paper describes a new parser which is much simpler than SPATTER, yet performs at least as well when trained and tested on the same Wall Street Journal data. The method uses lexical information directly by modeling head-modifier 1 relations between pairs of words. In this way it is similar to *This research was supported by ARPA Grant N6600194-C6043. 1By 'modifier' we mean the linguistic notion of either an argument or adjunct. Link grammars (Lafferty et al. 92), and dependency grammars in general. 2 T h e S t a t i s t i c a l M o d e l The aim of a parser is to take a tagged sentence as input (for example Figure l(a)) and produce a phrase-structure tree as output (Figure l(b)). A statistical approach to this problem consists of two components. First, the statistical model assigns a probability to every candidate parse tree for a sentence. Formally, given a sentence S and a tree T, the model estimates the conditional probability P(T[S) . The most likely parse under the model is then: Tb~,, -argmaxT P ( T I S ) (1) Second, the parser is a method for finding Tbest. This section describes the statistical model, while section 3 describes the parser. The key to the statistical model is that any tree such as Figure l(b) can be represented as a set of b a s e N P s 2 and a set of d e p e n d e n c i e s as in Figure l(c). We call the set of baseNPs B, and the set of dependencies D; Figure l(d) shows B and D for this example. For the purposes of our model, T = (B, D), and: P ( T I S ) = P ( B , D ] S ) = P(B[S) x P ( D ] S , B ) (2) S is the sentence with words tagged for part of speech. Tha t is, S = < (wl , t l ) , (w2, t2) . . . (w~, t , ) >. For POS tagging we use a maximum-entropy tagger described in (Ratnaparkhi 96). The tagger performs at around 97% accuracy on Wall Street Journal Text, and is trained on the first 40,000 sentences of the Penn Treebank (Marcus et al. 93). Given S and B, the r e d u c e d s e n t e n c e :~ is defined as the subsequence of S which is formed by removing punctuation and reducing all baseNPs to their head-word alone. ~A baseNP or 'minimal' NP is a non-recursive NP, i.e. none of its child constituents are NPs. The term was first used in (l:tamshaw and Marcus 95).