• Corpus ID: 221041002

Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction

  title={Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction},
  author={Stefan Heid and Marcel Wever and Eyke H{\"u}llermeier},
Syntactic annotation of corpora in the form of part-of-speech (POS) tags is a key requirement for both linguistic research and subsequent automated natural language processing (NLP) tasks. This problem is commonly tackled using machine learning methods, i.e., by training a POS tagger on a sufficiently large corpus of labeled data. While the problem of POS tagging can essentially be considered as solved for modern languages, historical corpora turn out to be much more difficult, especially due… 

Figures from this paper

Annotation Uncertainty in the Context of Grammatical Change
This article can be seen as an attempt to reconcile the perspectives of the main scientific disciplines involved in corpus projects, linguistics and computer science, to develop a unified view and to highlight the potential synergies between these disciplines.
Conformal prediction for text infilling and part-of-speech prediction
This paper proposes inductive conformal prediction algorithms for the tasks of text infilling and part-of-speech prediction for natural language data and demonstrates that the ICP algorithms are able to produce valid set-valued predictions that are small enough to be applicable in real-world applications.


Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
A novel approach to system combination for the case where available taggers use different tagsets, based on voteconstrained bootstrapping with unlabeled data, reaches 88.7% tagging accuracy, a new high in PTB-compatible tweet part-of-speech tagging.
An automatic part-of-speech tagger for Middle Low German
The present paper reports on a crucial step in creating the corpus, viz. the creation of a part-of-speech tagger for Middle Low German (MLG), which poses a challenge to standard POS taggers, which usually rely on normalized spelling.
Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network
A new part-of-speech tagger is presented that demonstrates the following ideas: explicit use of both preceding and following tag contexts via a dependency network representation, broad use of lexical features, and effective use of priors in conditional loglinear models.
Supporting the Cognitive Process in Annotation Tasks
A new annotation tool with pattern learning support providing the annotators with suggestions inferred from previously studied MLG texts is developed, guided by the cognitive annotation process that steps back from the common text analysis pipeline.
Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger
This paper presents results for a maximum-entropy-based part of speech tagger, which achieves superior performance principally by enriching the information sources used for tagging by incorporating these features: more extensive treatment of capitalization for unknown words, and features for the disambiguation of the tense forms of verbs.
A Maximum Entropy Model for Part-Of-Speech Tagging
A statistical model which trains from a corpus annotated with Part Of Speech tags and assigns them to previously unseen text with state of the art accuracy and discusses the corpus consistency problems discovered during the implementation of these features.
Natural Language Processing (Almost) from Scratch
We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity
Probabilistic part-of-speech tagging using decision trees
In this paper, a new probabilistic tagging method is presented which avoids problems that Markov Model based taggers face, when they have to estimate transition probabilities from sparse data. In
Improvements in Part-of-Speech Tagging with an Application to German
This paper presents a meta-modelling system that automates the very labor-intensive and therefore time-heavy and expensive process of manually tagging part-of-speech content in a variety of languages.
TnT - A Statistical Part-of-Speech Tagger
Contrary to claims found elsewhere in the literature, it is argued that a tagger based on Markov models performs at least as well as other current approaches, including the Maximum Entropy framework.