• Corpus ID: 211551725

(Semi-)Automatic Normalization of Historical Texts using Distance Measures and the Norma tool

@inproceedings{Bollmann2012SemiAutomaticNO,
  title={(Semi-)Automatic Normalization of Historical Texts using Distance Measures and the Norma tool},
  author={Marcel Bollmann},
  year={2012}
}
Historical texts typically show a high degree of variance in spelling. Normalization of variant word forms to their modern spellings can greatly benefit further processing of the data, e.g., POS tagging or lemmatization. This paper compares several approaches to normalization with a focus on methods based on string distance measures and evaluates them on two different types of historical texts. Furthermore, the Norma tool is introduced, an interactive normalization tool which is flexibly… 

Figures and Tables from this paper

Applying Rule-Based Normalization to Different Types of Historical Texts - An Evaluation
TLDR
An unsupervised, rule-based approach which maps historical wordforms to modern wordforms in the form of context-aware rewrite rules that apply to sequences of characters derived from two aligned versions of the Luther bible.
Automatic Normalization for Linguistic Annotation of Historical Language Data
TLDR
Different methods for spelling normalization of historical texts with regard to further processing with modern part-of-speech taggers are presented and evaluated and a chain combination using word-based and character-based techniques is shown to be best for normalization.
Normalisation of Historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting
TLDR
This paper presents a Levenshtein-based approach to normalisation of historical text to a modern spelling, and shows that this method is successful both in terms of normalisation accuracy, and by the performance of a standard modern tagger applied to the historical text.
Automatic Identification of Spelling Variation in Historical Texts
Languages in earlier stages of development differ from their modern analogues, reflecting syntactic, semantic and morphological changes over time. The study of these and other phenomena is the major
POS Tagging for Historical Texts with Sparse Training Data
This paper presents a method for part-ofspeech tagging of historical data and evaluates it on texts from different corpora of historical German (15th–18th century). Spelling normalization is used to
A Large-Scale Comparison of Historical Text Normalization Systems
TLDR
This paper presents the largest study of historical text normalization done so far, comparing systems spanning all categories of proposed normalization techniques, analysing the effect of training data quantity, and using different evaluation methods.
Spelling Normalization of Historical German with Sparse Training Data
TLDR
This paper presents an approach to spelling normalization that combines three different normalization algorithms and evaluates it on a diverse set of texts of historical German, showing that this approach produces acceptable results even with comparatively small amounts of training data.
Techniques for Automatic Normalization of Orthographically Variant Yiddish Texts
TLDR
Using a manually normalized set of 16 Yiddish documents as a training and test corpus, four techniques for automatic normalization were compared: a hand-crafted set of transformation rules, an off-the-shelf spell checker, edit distance minimization with manually set weights, and edit distance minimalization with weights learned through a training set.
Dealing with word-internal modification and spelling variation in data-driven lemmatization
TLDR
It is shown in an oracle setting that there is a possible increase in lemmatization accuracy of 14% with the methods to generate lemma candidates on Middle Low German, a group of historical dialects of German (1200‐1650 AD).
Token-based spelling variant detection in Middle Low German texts
TLDR
A pipeline for the detection of spelling variants, i.e., different spellings that represent the same word, in non-standard texts, is presented and can be used to improve the performance of natural language processing tools on the data by reducing the number of unknown words.
...
1
2
3
4
...

References

SHOWING 1-10 OF 21 REFERENCES
Applying Rule-Based Normalization to Different Types of Historical Texts - An Evaluation
TLDR
An unsupervised, rule-based approach which maps historical wordforms to modern wordforms in the form of context-aware rewrite rules that apply to sequences of characters derived from two aligned versions of the Luther bible.
VARD2 : a tool for dealing with spelling variation in historical corpora
TLDR
The VARD tool is presented, which facilitates pre-processing of historical corpus data by inserting modern equivalents alongside historical spelling variants, with particular focus on Early Modern English corpora.
From Old Texts to Modern Spellings: An Experiment in Automatic Normalisation
TLDR
This work adapted VARD2 (Baron and Rayson, 2008), a statistical tool for normalising spelling, for use with the Portuguese language and studied its performance over four time periods, showing that Vard2 performed best on the older letters and worst on the most modern ones.
Unsupervised Learning of Edit Distance Weights for Retrieving Historical Spelling Variations
TLDR
The algorithm, used for spell checking, adapted to the problem of information retrieval of historical words, with queries in modern spelling, uses stochastic weights, learned from training pairs of modern and historical spelling.
Generating Search Term Variants for Text Collections with Historic Spellings
TLDR
A new algorithm for generating search term variants in ancient orthography by applying a spell checker on a corpus of historic texts, which produces a set of probabilistic rules that can be considered for ranking in the retrieval stage.
More than Words: Using Token Context to Improve Canonicalization of Historical German
  • Bryan Jurish
  • Linguistics
    J. Lang. Technol. Comput. Linguistics
  • 2010
TLDR
A token-wise canonicalization techniques which process each input word independently andtoken-wise techniques which make use of the context in which a given instance of a word occurs are presented.
Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text
TLDR
This study assesses the effects of spelling variation on the performance of the tagger, and investigates to what extent tagger performance can be improved by using 'normalised' input, where spelling variants in the corpus are standardised to a modern form.
bokstaffua, bokstaffwa, bokstafwa, bokstaua, bokstawa ... Towards lexical link-up for a corpus of Old Swedish
TLDR
The ongoing work on handling spelling variations in Old Swedish texts, which lack a standardized orthography, is presented, and manually compiled substitution rules with rules automatically derived from spelling variants in a lexicon are compared.
Finding approximate matches in large lexicons
TLDR
This paper shows how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon, and proposes methods for combining these techniques, and shows experimentally that these combinations yield good retrieval effectiveness while keeping index size and retrieval time low.
TnT - A Statistical Part-of-Speech Tagger
TLDR
Contrary to claims found elsewhere in the literature, it is argued that a tagger based on Markov models performs at least as well as other current approaches, including the Maximum Entropy framework.
...
1
2
3
...