• Corpus ID: 196105

Corpus Variation and Parser Performance

  title={Corpus Variation and Parser Performance},
  author={Daniel Gildea},
  • D. Gildea
  • Published in EMNLP 2001
  • Computer Science
Most work in statistical parsing has focused on a single corpus: the Wall Street Journal portion of the Penn Treebank. While this has allowed for quantitative comparison of parsing techniques, it has left open the question of how other types of text might a ect parser performance, and how portable parsing models are across corpora. We examine these questions by comparing results for the Brown and WSJ corpora, and also consider which parts of the parser's probability model are particularly tuned… 

Tables from this paper

Evaluating a Statistical CCG Parser on Wikipedia
It is found that the C&C parser's standard model is 4.3% less accurate on Wikipedia text, but that a simple self-training exercise reduces the gap to 3.8%.
Parsing Any Domain English text to CoNLL dependencies
A benchmarking study of different state-of-art parsers for English, both constituency and dependency, and rerankers for Berkeley and Stanford parsers to study the usefulness of reranking for handling texts from different domains are reported.
Unbounded Dependency Recovery for Parser Evaluation
A new parser evaluation corpus containing around 700 sentences annotated with unbounded dependencies, from seven different grammatical constructions is introduced, to evaluate how well state-of-the-art parsing technology is able to recover such dependencies.
Parser Showdown at the Wall Street Corral: An Empirical Investigation of Error Types in Parser Output
This work classifies errors within a set of linguistically meaningful types using tree transformations that repair groups of errors together, and uses this analysis to answer a range of questions about parser behaviour, including what linguistic constructions are difficult for state-of-the-art parsers, what types of errors are being resolved by rerankers, and what types are introduced when parsing out- of-domain text.
Annotation Schemes and their Influence on Parsing Results
This paper uses two similar German treebanks, TuBa-D/Z and NeGra, and investigates the role that different annotation decisions play for parsing, and approximate the two treebanks by gradually taking out or inserting the corresponding annotation components and test the performance of a standard PCFG parser on all treebank versions.
Automatic Prediction of Parser Accuracy
This paper proposes a technique that automatically takes into account certain characteristics of the domains of interest, and accurately predicts parser performance on data from these new domains, and has a cheap and effective recipe for measuring the performance of a statistical parser on any given domain.
Reranking and Self-Training for Parser Adaptation
The reranking parser described in Charniak and Johnson (2005) improves performance of the parser on Brown to 85.2% and use of the self-training techniques described in (McClosky et al., 2006) raise this to 87.8% (an error reduction of 28%) again without any use of labeled Brown data.
Training a Parser for Machine Translation Reordering
The method is applied to train parsers that excel when used as part of a reordering component in a statistical machine translation system and uses a corpus of weakly-labeled reference reorderings to guide parser training.
Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques
The model combines full and partial parsing techniques to reach full grammar coverage on unseen data, and on a gold standard of manually annotated f-structures for a subset of the WSJ treebank, reaches 79% F-score.
Dutch Dependency Parser Performance Across Domains
This work evaluates the performance variation of two kinds of dependency parsing systems for Dutch (grammar-driven versus data-driven) across several domains and extends the statistical measures used by Zhang and Wang (2009a) for English and proposes a new simple measure to quantify domain sensitivity.


Statistical Decision-Tree Models for Parsing
SPATTER is described, a statistical parser based on decision-tree learning techniques which constructs a complete parse for every sentence and achieves accuracy rates far better than any published result.
A Statistical Parser for Czech
This paper considers statistical parsing of Czech, which differs radically from English in at least two respects: (1) it is a highly inflected language, and (2) it has relatively free word order.
A New Statistical Parser Based on Bigram Lexical Dependencies
A new statistical parser which is based on probabilities of dependencies between head-words in the parse tree, which trains on 40,000 sentences in under 15 minutes and can be improved to over 200 sentences a minute with negligible loss in accuracy.
Statistical Parsing with a Context-Free Grammar and Word Statistics
A parsing system based upon a language model for English that is, in turn, based upon assigning probabilities to possible parses for a sentence that outperforms previous schemes is described.
A Maximum-Entropy-Inspired Parser
A new parser for parsing down to Penn tree-bank style parse trees that achieves 90.1% average precision/recall for sentences of length 40 and less and 89.5% when trained and tested on the previously established sections of the Wall Street Journal treebank is presented.
Three Generative, Lexicalised Models for Statistical Parsing
A new statistical parsing model is proposed, which is a generative model of lexicalised context-free grammar and extended to include a probabilistic treatment of both subcategorisation and wh-movement.
Head-Driven Statistical Models for Natural Language Parsing
  • M. Collins
  • Computer Science
    Computational Linguistics
  • 2003
Three statistical models for natural language parsing are described, leading to approaches in which a parse tree is represented as the sequence of decisions corresponding to a head-centered, top-down derivation of the tree.
Supervised Grammar Induction using Training Data with Limited Constituent Information
It is shown that the most informative linguistic constituents are the higher nodes in the parse trees, typically denoting complex noun phrases and sentential clauses, and an adaptation strategy is proposed, which produces grammars that parse almost as well as Grammars induced from fully labeled corpora.
Using Register-Diversified Corpora for General Language Studies
  • D. Biber
  • Linguistics
    Comput. Linguistics
  • 1993
The present study summarizes corpus-based research on linguistic characteristics from several different structural levels, in English as well as other languages, showing that register variation is
How Verb Subcategorization Frequencies Are Affected By Corpus Choice
It is concluded that verb sense and discourse type play an important role in the frequencies observed in different experimental and corpus based sources of verb subcategorization frequencies.