Correspondence-guided Synchronous Parsing of Parallel Corpora

Abstract

We present an efficient dynamic programming algorithm for synchronous parsing of sentence pairs from a parallel corpus with a given word alignment. Unless there is a large proportion of words without a correspondence in the other language, the worstcase complexity is significantly reduced over standard synchronous parsing. The theoretical complexity results are corroborated by a quantitative experimental evaluation. Our longer-term goal is to induce monolingual grammars from a parallel corpus, exploiting implicit information about syntactic structure obtained from correspondence patterns.1 Here we provide an important prerequisite for parallel corpusbased grammar induction: an efficient algorithm for synchronous parsing, given a particular word alignment (e.g., the most likely option from a statistical alignment). Synchronous grammars. We assume a straightforward extension of context-free grammars (compare the transduction grammars of [Lewis II and Stearns, 1968]): (1) the terminal and non-terminal categories are pairs of symbols (or NIL); (2) the sequence of daughters can differ for the two languages; we use a compact rule notation with a numerical ranking for the linear precedence in each language. The general form of a rule is N0/M0 → N1:i1/M1:j1 . . . Nk:ik/Mk:jk , where Nl, Ml are NIL or a (non-)terminal symbol for language L1 and L2, respectively, and il, jl are natural numbers for the rank in the sequence for L1 and L2 (for NIL categories a special rank 0 is assumed). Compare fig. 1 for a sample analysis of the German/English sentence pair Wir müssen deshalb die Agrarpolitik prüfen/So we must look at the agricultural policy. We assume a normal form in which the right-hand side is ordered by the rank in L1. The formalism goes along with the continuity assumption that every complete constituent is continuous in both languages.3 Synchronous parsing. Our dynamic programming algorithm can be viewed as a variant of Earley parsing and generation, which again can be described by inference rules. For Cp. the new PTOLEMAIOS project at Saarland University (http://www.coli.uni-saarland.de/ ̃jonask/PTOLEMAIOS/). However, categories that are NIL in L1 come last. If there are several, they are viewed as unordered with respect to each other. As [Melamed, 2003] discusses, such an assumption is empirically problematic with binary grammars. However, if flat analyses are assumed for clauses and NPs, the typical problematic cases are resolved. instance, the central completion step in Earley parsing can be described by the rule4 (1) 〈X → α • Y β, [i, j]〉, 〈Y → γ •, [j, k]〉 〈X → α Y • β, [i, k]〉 The input in synchronous parsing is not a one-dimensional string, but a pair of sentences, i.e., a two-dimensional array of possible word pairs (or a multidimensional array if we are looking at a multilingual corpus). The natural way of generalizing context-free parsing to synchronous grammars is thus to use string indices in both dimensions. So we get inference rules like the following (there is another one in which the i2/j2 and j2/k2 indices are swapped between the two items above the line): (2) 〈X1/X2 → α • Y1:r1/Y2:r2 β, [i1, j1, j2, k2]〉, 〈Y1/Y2 → γ •, [j1, k1, i2, j2]〉 〈X1/X2 → α Y1:r1/Y2:r2 • β, [i1, k1, i2, k2]〉 Since each inference rule contains six free variables over string positions (i1, j1, k1, i2, j2, k2), we get a parsing complexity of order O(n) for unlexicalized grammars (where n is the number of words in the longer of the two strings from L1 and L2) [Wu, 1997; Melamed, 2003]. Correspondence-guided parsing. As an alternative to standard “rectangular indexing” we propose an asymmetric approach: one of the languages (L1) provides the “primary index” – the string span in L1 like in monolingual parsing. As a secondary index, L2 contributes a chart-generationstyle bit vector of the words covered, which is mainly used to guide parsing – i.e., certain options are eliminated. A complete sample index for müssen/must in fig. 1 would be 〈[1, 2], [00100000]〉. Completion can be formulated as inference rule (3).5 Condition (iii) excludes discontinuity in passive chart items, i.e., complete constituents; active items (i.e., partial constituents) may well contain discontinuities. (3) 〈X1/X2 → α • Y1:r1/Y2:r2 β, 〈[i, j], v〉〉, 〈Y1/Y2 → γ •, 〈[j, k], w〉〉 〈X1/X2 → α Y1:r1/Y2:r2 • β, 〈[i, k], u〉〉 where (i) j 6= k; (ii) OR(v,w) = u; (iii) w is continuous (i.e., it contains maximally one subsequence of 1’s). A chart item is specified through a position (•) in a production and a string span ([l1, l2]). 〈X → α • Y β, [i, j]〉 is an active item recording that between position i and j, an incomplete X phrase has been found, which covers α, but still misses Y β. Items with a final • are called passive. We use the bold-faced variables v,w,u for bit vectors; OR performs bitwise disjunction on the vectors.

Extracted Key Phrases

1 Figure or Table

Cite this paper

@inproceedings{Kuhn2005CorrespondenceguidedSP, title={Correspondence-guided Synchronous Parsing of Parallel Corpora}, author={Jonas Kuhn}, booktitle={IJCAI}, year={2005} }