• Publications
  • Influence
Detecting Errors in Part-of-Speech Annotation
A new method is proposed for detecting errors in "gold-standard" part-of-speech annotation based on n-grams occurring in the corpus with multiple taggings based on closed-class analysis and finite-state tagging guide patterns.
Detecting Inconsistencies in Treebanks
Treebanks generally result from a (semi-)manual markup process → errors from automatic processes, human post-editing, or human annotation.
On Detecting Errors in Dependency Treebanks
This work explores how a technique proposed for detecting errors in constituency-based syntactic annotation can be adapted to systematically detect errors in dependency annotation, and discusses results for dependency treebanks for Swedish, Czech, and German.
Error detection and correction in annotated corpora
A method for detecting and correcting errors in corpora with linguistic annotation using the so-called variation n-gram method, which can automatically correct errors with 85% accuracy and demonstrates that the notion of variation for detecting errors is a powerful one.
Prune Diseased Branches to Get Healthy Trees ! How to Find Erroneous Local Trees in a Treebank and Why It Matters
Annotated corpora are essential for training and testing algorithms in natural language processing (NLP), but even so-called gold-standard corpora contain a significant number of annotation errors
Defining Syntax for Learner Language Annotation
It is shown that subcategorization seems to better be able to underspecify annotation for situations where no single correct solution can be found, and represents a significant step in elucidating syntax for non-canonical language.
A balancing act: how can intelligent computer-generated feedback be provided in learner-to-learner interactions?
This work is designing a parser-based system that provides feedback on particle usage for first-year L2 Korean learners while they chat in CMC, and guides the content of the activity by using picture-based information-gap tasks and a game record, and controls the range of allowable learner input by using a word bank.
Towards Domain Adaptation for Parsing Web Data
It is found that approximating the in-domain data has a positive impact on parsing, and different ways to select out-of-domain parsed data to add to training are examined.
Generating Learner-Like Morphological Errors in Russian
A linguistically-informed method for generating learner-like morphological errors, relying on guiding stem and suffix combinations from a segmented lexicon to match particular error categories and relying on grammatical information from the original context is described.