• Corpus ID: 11249173

POS Tagging for Historical Texts with Sparse Training Data

@inproceedings{Bollmann2013POSTF,
  title={POS Tagging for Historical Texts with Sparse Training Data},
  author={Marcel Bollmann},
  booktitle={LAW@ACL},
  year={2013}
}
This paper presents a method for part-ofspeech tagging of historical data and evaluates it on texts from different corpora of historical German (15th–18th century). Spelling normalization is used to preprocess the texts before applying a POS tagger trained on modern German corpora. Using only 250 manually normalized tokens as training data, the tagging accuracy of a manuscript from the 15th century can be raised from 28.65% to 74.89%. 
Automatic Normalization for Linguistic Annotation of Historical Language Data
This paper deals with spelling normalization of historical texts with regard to further processing with modern part-of-speech taggers. Different methods for this task are presented and evaluated on a
Automatic Phrase Recognition in Historical German
Due to a lack of annotated data, theories of historical syntax are often based on very small, manually compiled data sets. To enable the empirical evaluation of existing hypotheses, the present study
Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts
TLDR
This paper shows that high quality part-of-speech tagging and lemmatization of historical texts is possible while operating directly on the historical spelling, and achieves state of theart results for modern German morphological tagging on the Tiger corpus and also on two historical corpora which have been used in previous work.
Analysis of Part-Of-Speech Tagging of Historical German Texts
TLDR
This adaptable approach, by training taggers on a target language variety, to improve the accuracy of the structure of historical German corpora at the level of part-of-speech-tagging (hereafter POS- tagging), provides reliable data allowing the use of taggers for analysis of different historical texts.
Variations on the theme of variation: Dealing with spelling variation for finegrained POS tagging of historical texts
TLDR
This paper investigates different ways of dealing with spelling variation in such a situation on different variants of historical German that contain spelling variation, and recommends rule-based simplification and substitution of spelling variants for low-resourced settings.
Applying Rule-Based Normalization to Different Types of Historical Texts - An Evaluation
TLDR
An unsupervised, rule-based approach which maps historical wordforms to modern wordforms in the form of context-aware rewrite rules that apply to sequences of characters derived from two aligned versions of the Luther bible.
A Multilingual Evaluation of Three Spelling Normalisation Methods for Historical Text
TLDR
The evaluation of approaches for spelling normalisation of historical text based on data from five languages shows that the machine translation approach often gives the best results, but also that all approaches improve over the baseline and that no single method works best for all languages.
Part-Of-Speech in Historical Corpora: Tagger Evaluation and Ensemble Systems on ARCHER
TLDR
Tagger accuracy improves by using a version of the corpus that has been automatically mapped to PDE spelling with VARD, and by combining several part-of-speech taggers in an ensemble system – which improves tagging by about 1% over CLAWS and 2% over Tree-Tagger.
Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction
TLDR
A large number of historical texts are not available in an electronic format, and even if they are, they are unlikely to be suitable for use in an e-book format.
How to tag non-standard language: Normalisation versus domain adaptation for Slovene historical and user-generated texts
TLDR
This paper investigates the two approaches, normalisation and domain adaptation, on carefully constructed data sets encompassing historical and user-generated Slovene texts, in particular focusing on the amount of labour necessary to produce the manually annotated data sets for each approach and comparing the resulting PoS accuracy.
...
1
2
3
...

References

SHOWING 1-10 OF 18 REFERENCES
POS-Tagging of Historical Language Data: First Experiments
TLDR
As expected, tagging with “normalized”, quasi-standardized tokens performs best (accuracy > 91%).
A Gold Standard Corpus of Early Modern German
TLDR
An annotated gold standard sample corpus of Early Modern German containing over 50,000 tokens of text manually annotated with POS tags, lemmas, and normalised spelling variants is described, providing an example of the requirements and needs of smaller humanities-based corpus projects.
Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora
TLDR
Evaluating the accuracy of existing POS taggers, trained on modern English, when they are applied to Early Modern English (EModE) datasets highlights the extent to which the handling of orthographic variants is sufficient for the tagging accuracy of EModE data to approximate to the levels attained on modernday text(s).
(Semi-)Automatic Normalization of Historical Texts using Distance Measures and the Norma tool
Historical texts typically show a high degree of variance in spelling. Normalization of variant word forms to their modern spellings can greatly benefit further processing of the data, e.g., POS
VARD2 : a tool for dealing with spelling variation in historical corpora
TLDR
The VARD tool is presented, which facilitates pre-processing of historical corpus data by inserting modern equivalents alongside historical spelling variants, with particular focus on Early Modern English corpora.
Manual and semi-automatic normalization of historical spelling - case studies from Early New High German
TLDR
Norma is presented, a semi-automatic normalization tool that integrates different modules (lexicon lookup, rewrite rules) for normalizing words in an interactive way and dynamically updates the set of rule entries, given new input.
Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text
TLDR
This study assesses the effects of spelling variation on the performance of the tagger, and investigates to what extent tagger performance can be improved by using 'normalised' input, where spelling variants in the corpus are standardised to a modern form.
Estimation of Conditional Probabilities With Decision Trees and an Application to Fine-Grained POS Tagging
TLDR
A HMM part-of-speech tagging method which is particularly suited for POS tagsets with a large number of fine-grained tags based on splitting of the POS tags into attribute vectors and decomposition of the contextual POS probabilities of the HMM into a product of attribute probabilities.
Annotation and Representation of a Diachronic Corpus of Spanish
TLDR
This article describes two different strategies for the automatic tagging of a Spanish diachronic corpus involving the adaptation of existing NLP tools developed for modern Spanish and proposes a new one, which does not consist in adapting the source texts to the taggers, but rather in modifying the tagger for the direct treatment of the old variants.
From Old Texts to Modern Spellings: An Experiment in Automatic Normalisation
TLDR
This work adapted VARD2 (Baron and Rayson, 2008), a statistical tool for normalising spelling, for use with the Portuguese language and studied its performance over four time periods, showing that Vard2 performed best on the older letters and worst on the most modern ones.
...
1
2
...