UdS-(retrain|distributional|surface): Improving POS Tagging for OOV Words in German CMC and Web Data

@inproceedings{Prange2016UdSretraindistributionalsurfaceIP,
  title={UdS-(retrain|distributional|surface): Improving POS Tagging for OOV Words in German CMC and Web Data},
  author={Jakob Prange and Andrea Horbach and Stefan Thater},
  booktitle={WAC@ACL},
  year={2016}
}
We present in this paper our three system submissions for the POS tagging subtask of the Empirist Shared Task: Our baseline systemUdS-retrain extends a standard training dataset with in-domain training data; UdSdistributional and UdS-surface add two different ways of handling OOV words on top of the baseline system by using either distributional information or a combination of surface similarity and language model information. We reach the best performance using the distributional model. 

Tables from this paper

Fine-Grained POS Tagging of German Social Media and Web Texts
TLDR
This paper takes a simple Hidden Markov Model based tagger as a starting point, and extends it with a distributional approach to estimating lexical (emission) probabilities of out-of-vocabulary words, which occur frequently in social media and web texts and are a major reason for the low performance of off-the-shelf taggers on these types of text.
A harmonised testsuite for POS tagging of German social media data
TLDR
A testsuite for POS tagging German web data that provides the original raw text as well as the gold tokenisations and is annotated for parts-of-speech, and shows how different experimental setups influence the accuracy of the taggers.
EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora
TLDR
The two subtasks of tokenization and part-of-speech tagging were performed on two data sets: a genuine CMC data set with samples from several CMC genres, and a Web corpora data set of CC-licensed Web pages which represents the type of data found in large corpora crawled from the Web.
Proceedings of the 10th Web as Corpus Workshop, WAC@ACL 2016, Berlin, August 12, 2016
TLDR
Preliminary results from an ongoing experiment wherein two large unstructured text corpora are classified by topic domain (or subject area) are described, indicating that a revised classification scheme and larger gold standard corpora will likely lead to a substantial increase in accuracy.
SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts
TLDR
SoMeWeTa is described, a part-of-speech tagger based on the averaged structured perceptron that is capable of domain adaptation and that can use various external resources that substantially improves on the state of the art for both the web and the social media data sets.
Language Technologies for the Challenges of the Digital Age
TLDR
The study shows that the method can be applied successfully to spokenlanguage, compares different ways of dealing with structures that are specific to spoken language corpora, analyses some remaining problems, and discusses ways of optimising precision or recall for the method.

References

SHOWING 1-10 OF 15 REFERENCES
Unsupervised Induction of Part-of-Speech Information for OOV Words in German Internet Forum Posts
We show that the accuracy of part-ofspeech (POS) tagging of German Internet forum posts can be improved substantially by exploiting distributional similarity information about out-of-vocabulary (OOV)
Improving the Performance of Standard Part-of-Speech Taggers for Computer-Mediated Communication
TLDR
It is found that extending a standard training set with small amounts of manually annotated data for Internet texts leads to a substantial improvement of tagger performance, which can be further improved by using a previously proposed method to automatically acquire training data.
Fast Domain Adaptation for Part of Speech Tagging for Dialogues
TLDR
This work investigates a fast method for domain adaptation, which provides additional in-domain training data from an unannotated data set by applying POS taggers with different biases to the unannotate data set and then choosing the set of sentences on which the taggers agree.
Internet Corpora: A Challenge for Linguistic Processing
TLDR
A range of easy-to-implement methods of adapting existing part-of-speech taggers to improve their performance on Internet texts are explored and it is shown that these methods can improve tagger performance substantially.
Automatically Constructing a Normalisation Dictionary for Microblogs
TLDR
This paper proposes a method for constructing a dictionary of lexical variants of known words that facilitates lexical normalisation via simple string substitution and shows that a dictionary-based approach achieves state-of-the-art performance for both F-score and word error rate on a standard dataset.
HunPos: an open source trigram tagger
TLDR
HunPos is presented, a free and open source (LGPL-licensed) alternative, which can be tuned by the user to fully utilize the potential of HMM architectures, offering performance comparable to more complex models, but preserving the ease and speed of the training and tagging process.
TIGER: Linguistic Interpretation of a German Corpus
TLDR
The TIGER Treebank, a corpus of currently 40,000 syntactically annotated German newspaper sentences, is described and the query language which was designed to facilitate a simple formulation of complex queries is described, a graphical user interface for query input.
Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results
TLDR
This paper attempts to adapt a state-of-the-art English POS tagger, which is trained on the Wall-Street-Journal corpus, to noisy text, and demonstrates the working of the proposed models on a Short Message Service (SMS) dataset which achieve a significant improvement over the baseline accuracy.
Optimierung des Stuttgart-Tübingen-Tagset für die linguistische Annotation von Korpora zur internetbasierten Kommunikation: Phänomene, Herausforderungen, Erweiterungsvorschläge
TLDR
Theoretisch gezeigt (vgl. empirisch) ein weiteres Rahmen von Repräsentationsstandards ein wirklich relevantes System für €20,000 pro Jahr werden einzelne Standards-Standards gesetzt.
SRILM - an extensible language modeling toolkit
TLDR
The functionality of the SRILM toolkit is summarized and its design and implementation is discussed, highlighting ease of rapid prototyping, reusability, and combinability of tools.
...
1
2
...