• Corpus ID: 11249173

POS Tagging for Historical Texts with Sparse Training Data

  title={POS Tagging for Historical Texts with Sparse Training Data},
  author={Marcel Bollmann},
This paper presents a method for part-ofspeech tagging of historical data and evaluates it on texts from different corpora of historical German (15th–18th century). Spelling normalization is used to preprocess the texts before applying a POS tagger trained on modern German corpora. Using only 250 manually normalized tokens as training data, the tagging accuracy of a manuscript from the 15th century can be raised from 28.65% to 74.89%. 

Tables from this paper

Automatic Normalization for Linguistic Annotation of Historical Language Data

Different methods for spelling normalization of historical texts with regard to further processing with modern part-of-speech taggers are presented and evaluated and a chain combination using word-based and character-based techniques is shown to be best for normalization.

Automatic Phrase Recognition in Historical German

The evaluation shows that the unlexicalized parser outperforms the sequence labeling approach, achieving F1-scores of 87%–91% on modern German and between 73% and 85% on different historical corpora.

Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts

This paper shows that high quality part-of-speech tagging and lemmatization of historical texts is possible while operating directly on the historical spelling, and achieves state of theart results for modern German morphological tagging on the Tiger corpus and also on two historical corpora which have been used in previous work.

Analysis of Part-Of-Speech Tagging of Historical German Texts

This adaptable approach, by training taggers on a target language variety, to improve the accuracy of the structure of historical German corpora at the level of part-of-speech-tagging (hereafter POS- tagging), provides reliable data allowing the use of taggers for analysis of different historical texts.

Variations on the theme of variation: Dealing with spelling variation for finegrained POS tagging of historical texts

This paper investigates different ways of dealing with spelling variation in such a situation on different variants of historical German that contain spelling variation, and recommends rule-based simplification and substitution of spelling variants for low-resourced settings.

Applying Rule-Based Normalization to Different Types of Historical Texts - An Evaluation

An unsupervised, rule-based approach which maps historical wordforms to modern wordforms in the form of context-aware rewrite rules that apply to sequences of characters derived from two aligned versions of the Luther bible.

A Multilingual Evaluation of Three Spelling Normalisation Methods for Historical Text

The evaluation of approaches for spelling normalisation of historical text based on data from five languages shows that the machine translation approach often gives the best results, but also that all approaches improve over the baseline and that no single method works best for all languages.

Part-Of-Speech in Historical Corpora: Tagger Evaluation and Ensemble Systems on ARCHER

Tagger accuracy improves by using a version of the corpus that has been automatically mapped to PDE spelling with VARD, and by combining several part-of-speech taggers in an ensemble system – which improves tagging by about 1% over CLAWS and 2% over Tree-Tagger.

Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction

A large number of historical texts are not available in an electronic format, and even if they are, they are unlikely to be suitable for use in an e-book format.

How to tag non-standard language: Normalisation versus domain adaptation for Slovene historical and user-generated texts

This paper investigates the two approaches, normalisation and domain adaptation, on carefully constructed data sets encompassing historical and user-generated Slovene texts, in particular focusing on the amount of labour necessary to produce the manually annotated data sets for each approach and comparing the resulting PoS accuracy.



POS-Tagging of Historical Language Data: First Experiments

As expected, tagging with “normalized”, quasi-standardized tokens performs best (accuracy > 91%).

A Gold Standard Corpus of Early Modern German

An annotated gold standard sample corpus of Early Modern German containing over 50,000 tokens of text manually annotated with POS tags, lemmas, and normalised spelling variants is described, providing an example of the requirements and needs of smaller humanities-based corpus projects.

Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora

Evaluating the accuracy of existing POS taggers, trained on modern English, when they are applied to Early Modern English (EModE) datasets highlights the extent to which the handling of orthographic variants is sufficient for the tagging accuracy of EModE data to approximate to the levels attained on modernday text(s).

(Semi-)Automatic Normalization of Historical Texts using Distance Measures and the Norma tool

This paper compares several approaches to normalization with a focus on methods based on string distance measures and evaluates them on two different types of historical texts, showing that a combination of normalization methods produces the best results.

VARD2 : a tool for dealing with spelling variation in historical corpora

The VARD tool is presented, which facilitates pre-processing of historical corpus data by inserting modern equivalents alongside historical spelling variants, with particular focus on Early Modern English corpora.

Manual and semi-automatic normalization of historical spelling - case studies from Early New High German

Norma is presented, a semi-automatic normalization tool that integrates different modules (lexicon lookup, rewrite rules) for normalizing words in an interactive way and dynamically updates the set of rule entries, given new input.

Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text

This study assesses the effects of spelling variation on the performance of the tagger, and investigates to what extent tagger performance can be improved by using 'normalised' input, where spelling variants in the corpus are standardised to a modern form.

Estimation of Conditional Probabilities With Decision Trees and an Application to Fine-Grained POS Tagging

A HMM part-of-speech tagging method which is particularly suited for POS tagsets with a large number of fine-grained tags based on splitting of the POS tags into attribute vectors and decomposition of the contextual POS probabilities of the HMM into a product of attribute probabilities.

Annotation and Representation of a Diachronic Corpus of Spanish

This article describes two different strategies for the automatic tagging of a Spanish diachronic corpus involving the adaptation of existing NLP tools developed for modern Spanish and proposes a new one, which does not consist in adapting the source texts to the taggers, but rather in modifying the tagger for the direct treatment of the old variants.

From Old Texts to Modern Spellings: An Experiment in Automatic Normalisation

This work adapted VARD2 (Baron and Rayson, 2008), a statistical tool for normalising spelling, for use with the Portuguese language and studied its performance over four time periods, showing that Vard2 performed best on the older letters and worst on the most modern ones.