A Large-Scale Comparison of Historical Text Normalization Systems

@article{Bollmann2019ALC,
  title={A Large-Scale Comparison of Historical Text Normalization Systems},
  author={Marcel Bollmann},
  journal={ArXiv},
  year={2019},
  volume={abs/1904.02036}
}
There is no consensus on the state-of-the-art approach to historical text normalization. Many techniques have been proposed, including rule-based methods, distance metrics, character-based statistical machine translation, and neural encoder–decoder models, but studies have used different datasets, different evaluation methods, and have come to different conclusions. This paper presents the largest study of historical text normalization done so far. We critically survey the existing literature… 

Figures and Tables from this paper

Few-Shot and Zero-Shot Learning for Historical Text Normalization
TLDR
This paper evaluates 63 multi-task learning configurations for sequence-to-sequence-based historical text normalization across ten datasets from eight languages, using autoencoding, grapheme- to-phoneme mapping, and lemmatization as auxiliary tasks and shows that zero-shot learning outperforms the simple, but relatively strong, identity baseline.
Handling Heavily Abbreviated Manuscripts: HTR engines vs text normalisation approaches
TLDR
This work explores different setups to obtain a normalised text from medieval Latin manuscripts, by training HTR engines on normalised (i.e., expanded, disabbreviated) text, or by decomposing the process into discrete steps, each making use of specialist models for recognition, word segmentation and normalisation.
Context-Aware Text Normalisation for Historical Dialects
TLDR
A multidialect normaliser with a context-aware reranking of the candidates that relies on a word-level n-gram language model that is applied to the five best normalisation candidates and shows that incorporating dialectal information into the training leads to an accuracy improvement on all the datasets.
Summarising Historical Text in Modern Languages
TLDR
This work reports automatic and human evaluations that distinguish the historic to modern language summarisation task from standard cross-lingual summarisation, highlights the distinctness and value of the dataset, and demonstrates that the transfer learning approach outperforms standardCrosslingual benchmarks on this task.
Semi-supervised Contextual Historical Text Normalization
TLDR
By utilizing a simple generative normalization model and obtaining powerful contextualization from the target-side language model, this work trains accurate models with unlabeled historical data at the same accuracy levels.
Overview of CLEF HIPE 2020: Named Entity Recognition and Linking on Historical Newspapers
TLDR
An overview of the first edition of HIPE (Identifying Historical People, Places and other Entities), a pioneering shared task dedicated to the evaluation of named entity processing on historical newspapers in French, German and English, and its objective is strengthening the robustness of existing approaches on non-standard inputs, enabling performance comparison of NEprocessing on historical texts, and fostering efficient semantic indexing of historical documents.
Introducing the CLEF 2020 HIPE Shared Task: Named Entity Recognition and Linking on Historical Newspapers
TLDR
The CLEF 2020 Evaluation Lab HIPE (Identifying Historical People, Places and other Entities) on named entity recognition and linking on diachronic historical newspaper material in French, German and English is introduced.
Extended Overview of CLEF HIPE 2020: Named Entity Processing on Historical Newspapers
This paper presents an extended overview of the first edition of HIPE (Identifying Historical People, Places and other Entities), a pioneering shared task dedicated to the evaluation of named entity
Transfer Learning for Historical Corpora: An Assessment on Post-OCR Correction and Named Entity Recognition
TLDR
It is found that using pre- trained language models helps with NER but less so with post-OCR correction, and pre-trained language models should be used critically when working with OCRed historical corpora.
Sesame Street to Mount Sinai: BERT-constrained character-level Moses models for multilingual lexical normalization
TLDR
This paper describes the HEL-LJU submissions to the MultiLexNorm shared task on multilingual lexical normalization, and shows how the system is based on a BERT token classification preprocessing step, and a character-level SMT step where the text is translated from original to normalized given the BERT-predicted transformation constraints.
...
1
2
3
4
...

References

SHOWING 1-10 OF 63 REFERENCES
(Semi-)Automatic Normalization of Historical Texts using Distance Measures and the Norma tool
TLDR
This paper compares several approaches to normalization with a focus on methods based on string distance measures and evaluates them on two different types of historical texts, showing that a combination of normalization methods produces the best results.
Normalisation of Historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting
TLDR
This paper presents a Levenshtein-based approach to normalisation of historical text to a modern spelling, and shows that this method is successful both in terms of normalisation accuracy, and by the performance of a standard modern tagger applied to the historical text.
Multi-task learning for historical text normalization: Size matters
TLDR
The main finding—contrary to what has been observed for other NLP tasks—is that multi-task learning mainly works when target task data is very scarce.
An approach to unsupervised historical text normalisation
We present a novel approach to unsupervised noisy text correction. Our approach is based on automatic extraction of historical variation patterns by analysing the structure of the words from a
A Multilingual Evaluation of Three Spelling Normalisation Methods for Historical Text
TLDR
The evaluation of approaches for spelling normalisation of historical text based on data from five languages shows that the machine translation approach often gives the best results, but also that all approaches improve over the baseline and that no single method works best for all languages.
Improving historical spelling normalization with bi-directional LSTMs and multi-task learning
TLDR
This work explores the suitability of a deep neural network architecture for historical documents processing, particularly a deep bi-LSTM network applied on a character level, and shows that multi-task learning with additional normalization data can improve the model’s performance further.
Spelling normalization of historical documents by using a machine translation approach
TLDR
Three approaches are proposed—based on statistical, neural and character-based machine translation— to adapt the document’s spelling to modern standards to create a digital text version of a historical document.
Evaluating Historical Text Normalization Systems: How Well Do They Generalize?
TLDR
It is shown that the neural models generalize well to unseen words in tests on five languages; nevertheless, they provide no clear benefit over the naïve baseline for downstream POS tagging of an English historical collection.
Normalizing Medieval German Texts: from rules to deep learning
TLDR
This comparative evaluation test the following three approaches to text canonicalization on historical German texts from 15th–16th centuries: rule-based, statistical machine translation, and neural machine translation.
Rule-Based Normalization of Historical Texts
TLDR
An unsupervised, rulebased approach which maps historical wordforms to modern wordforms through context-aware rewrite rules that apply to sequences of characters derived from two aligned versions of the Luther bible.
...
1
2
3
4
5
...