• Corpus ID: 14467927

Rule-Based Normalization of Historical Texts

  title={Rule-Based Normalization of Historical Texts},
  author={Marcel Bollmann and Florian Petran and Stefanie Dipper},
This paper deals with normalization of language data from Early New High German. We describe an unsupervised, rulebased approach which maps historical wordforms to modern wordforms. Rules are specified in the form of context-aware rewrite rules that apply to sequences of characters. They are derived from two aligned versions of the Luther bible and weighted according to their frequency. The evaluation shows that our approach (83%‐91% exact matches) clearly outperforms the baseline (65%). 

Figures and Tables from this paper

Applying Rule-Based Normalization to Different Types of Historical Texts - An Evaluation
An unsupervised, rule-based approach which maps historical wordforms to modern wordforms in the form of context-aware rewrite rules that apply to sequences of characters derived from two aligned versions of the Luther bible.
Automatic Normalization for Linguistic Annotation of Historical Language Data
Different methods for spelling normalization of historical texts with regard to further processing with modern part-of-speech taggers are presented and evaluated and a chain combination using word-based and character-based techniques is shown to be best for normalization.
A Multilingual Evaluation of Three Spelling Normalisation Methods for Historical Text
The evaluation of approaches for spelling normalisation of historical text based on data from five languages shows that the machine translation approach often gives the best results, but also that all approaches improve over the baseline and that no single method works best for all languages.
Rule-based normalisation of historical text - A diachronic study
The impact of a set of hand-crafted normalisation rules on Swedish texts ranging from 1527 to 1812 is explored, showing that spelling correction is a useful strategy for applying contemporary NLP tools to historical text.
Normalisation of Historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting
This paper presents a Levenshtein-based approach to normalisation of historical text to a modern spelling, and shows that this method is successful both in terms of normalisation accuracy, and by the performance of a standard modern tagger applied to the historical text.
Normalizing Medieval German Texts: from rules to deep learning
This comparative evaluation test the following three approaches to text canonicalization on historical German texts from 15th–16th centuries: rule-based, statistical machine translation, and neural machine translation.
Automatic Identification of Spelling Variation in Historical Texts
Languages in earlier stages of development differ from their modern analogues, reflecting syntactic, semantic and morphological changes over time. The study of these and other phenomena is the major
Investigating Diatopic Variation in a Historical Corpus
This paper investigates diatopic variation in a historical corpus of German by derived replacement rules and mappings which describe the relations between word forms and shows that this approach can replicate results from historical linguistics.
Spelling Normalization of Historical German with Sparse Training Data
This paper presents an approach to spelling normalization that combines three different normalization algorithms and evaluates it on a diverse set of texts of historical German, showing that this approach produces acceptable results even with comparatively small amounts of training data.
An SMT Approach to Automatic Annotation of Historical Text
This paper proposes an approach to tagging and parsing of historical text, using characterbased SMT methods for translating the historical spelling to a modern spelling before applying the NLP tools, and shows that this approach to spelling normalisation is successful even with small amounts of training data and is generalisable to several languages.


Generating Search Term Variants for Text Collections with Historic Spellings
A new algorithm for generating search term variants in ancient orthography by applying a spell checker on a corpus of historic texts, which produces a set of probabilistic rules that can be considered for ranking in the retrieval stage.
More than Words: Using Token Context to Improve Canonicalization of Historical German
  • Bryan Jurish
  • Linguistics
    J. Lang. Technol. Comput. Linguistics
  • 2010
A token-wise canonicalization techniques which process each input word independently andtoken-wise techniques which make use of the context in which a given instance of a word occurs are presented.
Europarl: A Parallel Corpus for Statistical Machine Translation
A corpus of parallel text in 11 languages from the proceedings of the European Parliament is collected and its acquisition and application as training data for statistical machine translation (SMT) is focused on.
Unsupervised Learning of Edit Distance Weights for Retrieving Historical Spelling Variations
The algorithm, used for spell checking, adapted to the problem of information retrieval of historical words, with queries in modern spelling, uses stochastic weights, learned from training pairs of modern and historical spelling.
Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora
This work has developed a novel approach which is fast and allows it to achieve high accuracy in terms of F1 for the alignment of both asymmetrical and symmetrical parallel corpora.
A Systematic Comparison of Various Statistical Alignment Models
An important result is that refined alignment models with a first-order dependence and a fertility model yield significantly better results than simple heuristic models.
Predicting intelligibility and perceived linguistic distance by means of the Levenshtein algorithm
Permission is granted by the publishers to post this file on a closed server which is accessible to members (students and staff) only of the author’s/s’ institute.
Automatic Standardization of Spelling for Historical Text Mining
Information retrieval for languages that lack a fixed orthography
  • Seminar Paper
  • 2003
Die Rolle Luthers für die deutsche Sprachgeschichte