Applying Rule-Based Normalization to Different Types of Historical Texts - An Evaluation

@inproceedings{Bollmann2011ApplyingRN,
  title={Applying Rule-Based Normalization to Different Types of Historical Texts - An Evaluation},
  author={Marcel Bollmann and Florian Petran and Stefanie Dipper},
  booktitle={LTC},
  year={2011}
}
This paper deals with normalization of language data from Early New High German. We describe an unsupervised, rule-based approach which maps historical wordforms to modern wordforms. Rules are specified in the form of context-aware rewrite rules that apply to sequences of characters. They are derived from two aligned versions of the Luther bible and weighted according to their frequency. Applying the normalization rules to texts by Luther results in 91 % exact matches, clearly outperforming the… 

Tables from this paper

Manual and semi-automatic normalization of historical spelling - case studies from Early New High German
TLDR
Norma is presented, a semi-automatic normalization tool that integrates different modules (lexicon lookup, rewrite rules) for normalizing words in an interactive way and dynamically updates the set of rule entries, given new input.
(Semi-)Automatic Normalization of Historical Texts using Distance Measures and the Norma tool
TLDR
This paper compares several approaches to normalization with a focus on methods based on string distance measures and evaluates them on two different types of historical texts, showing that a combination of normalization methods produces the best results.
Using Comparable Collections of Historical Texts for Building a Diachronic Dictionary for Spelling Normalization
In this paper, we argue that comparable collections of historical written resources can help overcoming typical challenges posed by heritage texts enhancing spelling normalization, POS-tagging and
Unsupervised regularization of historical texts for POS tagging
TLDR
This paper presents an unsupervised method to reduce spelling variation in historical texts in order to mitigate the problem of data sparsity and evaluates the usefulness of this approach using POS tagging.
Token-based spelling variant detection in Middle Low German texts
TLDR
A pipeline for the detection of spelling variants, i.e., different spellings that represent the same word, in non-standard texts, is presented and can be used to improve the performance of natural language processing tools on the data by reducing the number of unknown words.
Normalizing historical orthography for OCR historical documents using LSTM
TLDR
This paper proposes a new technique to model the target modern language by means of a recurrent neural network with long-short term memory architecture and shows the proposed LSTM model outperforms on normalizing the modern wordform to historical wordform.
Automatic Normalisation of Historical Text
TLDR
This thesis evaluates three models: a Hidden Markov Model, which has not been previously used for historical text normalisation; a soft attention Neural Network model,Which achieves state-of-the-art normalisation accuracy in all datasets, even when the volume of training data is restricted.
The Anselm Corpus: Methods and Perspectives of a Parallel Aligned Corpus
This paper presents ongoing work in the Anselm project at Ruhr-University Bochum, which deals with a parallel corpus of historical language data. We first present our corpus, which consists of about
Evaluating Historical Text Normalization Systems: How Well Do They Generalize?
TLDR
It is shown that the neural models generalize well to unseen words in tests on five languages; nevertheless, they provide no clear benefit over the naïve baseline for downstream POS tagging of an English historical collection.
Proceedings of the 13th Conference on Natural Language Processing, KONVENS 2016, Bochum, Germany, September 19-21, 2016
TLDR
This paper provides some concrete examples and discussion of potential pitfalls of working with non-automated analyses of “non-standard data” with CL methods, and notes that surprises can be in store even in well-studied data sets.
...
1
2
...

References

SHOWING 1-10 OF 21 REFERENCES
Rule-Based Normalization of Historical Texts
TLDR
An unsupervised, rulebased approach which maps historical wordforms to modern wordforms through context-aware rewrite rules that apply to sequences of characters derived from two aligned versions of the Luther bible.
Manual and semi-automatic normalization of historical spelling - case studies from Early New High German
TLDR
Norma is presented, a semi-automatic normalization tool that integrates different modules (lexicon lookup, rewrite rules) for normalizing words in an interactive way and dynamically updates the set of rule entries, given new input.
(Semi-)Automatic Normalization of Historical Texts using Distance Measures and the Norma tool
TLDR
This paper compares several approaches to normalization with a focus on methods based on string distance measures and evaluates them on two different types of historical texts, showing that a combination of normalization methods produces the best results.
An SMT Approach to Automatic Annotation of Historical Text
TLDR
This paper proposes an approach to tagging and parsing of historical text, using characterbased SMT methods for translating the historical spelling to a modern spelling before applying the NLP tools, and shows that this approach to spelling normalisation is successful even with small amounts of training data and is generalisable to several languages.
Unsupervised Learning of Edit Distance Weights for Retrieving Historical Spelling Variations
TLDR
The algorithm, used for spell checking, adapted to the problem of information retrieval of historical words, with queries in modern spelling, uses stochastic weights, learned from training pairs of modern and historical spelling.
More than Words: Using Token Context to Improve Canonicalization of Historical German
  • Bryan Jurish
  • Linguistics
    J. Lang. Technol. Comput. Linguistics
  • 2010
TLDR
A token-wise canonicalization techniques which process each input word independently andtoken-wise techniques which make use of the context in which a given instance of a word occurs are presented.
POS Tagging for Historical Texts with Sparse Training Data
This paper presents a method for part-ofspeech tagging of historical data and evaluates it on texts from different corpora of historical German (15th–18th century). Spelling normalization is used to
Generating Search Term Variants for Text Collections with Historic Spellings
TLDR
A new algorithm for generating search term variants in ancient orthography by applying a spell checker on a corpus of historic texts, which produces a set of probabilistic rules that can be considered for ranking in the retrieval stage.
Natural Language Processing for Historical Texts
  • M. Piotrowski
  • Art
    Synthesis Lectures on Human Language Technologies
  • 2012
TLDR
This book aims to give an introduction to NLP for historical texts and an overview of the state of the art in this field, including specific methods, such as creating part-of-speech taggers for historical languages or handling spelling variation.
Modernizing historical Slovene words with character-based SMT
We propose a language-independent word normalization method exemplified on modernizing historical Slovene words. Our method relies on character-based statistical machine translation and uses only
...
1
2
3
...