• Corpus ID: 215824896

Improving historical spelling normalization with bi-directional LSTMs and multi-task learning

@inproceedings{Bollmann2016ImprovingHS,
  title={Improving historical spelling normalization with bi-directional LSTMs and multi-task learning},
  author={Marcel Bollmann and Anders S{\o}gaard},
  booktitle={COLING},
  year={2016}
}
Natural-language processing of historical documents is complicated by the abundance of variant spellings and lack of annotated data. A common approach is to normalize the spelling of historical words to modern forms. We explore the suitability of a deep neural network architecture for this task, particularly a deep bi-LSTM network applied on a character level. Our model compares well to previously established normalization algorithms when evaluated on a diverse set of texts from Early New High… 

Figures and Tables from this paper

Learning attention for historical text normalization by learning to pronounce
TLDR
Interestingly, it is observed that, as previously conjectured, multi-task learning can learn to focus attention during decoding, in ways remarkably similar to recently proposed attention mechanisms, which is an important step toward understanding how MTL works.
An Evaluation of Neural Machine Translation Models on Historical Spelling Normalization
TLDR
The results show that NMT models are much better than SMT models in terms of character error rate, and the vanilla RNNs are competitive to GRUs/LSTMs in historical spelling normalization.
Multi-task learning for historical text normalization: Size matters
TLDR
The main finding—contrary to what has been observed for other NLP tasks—is that multi-task learning mainly works when target task data is very scarce.
Normalizing Medieval German Texts: from rules to deep learning
TLDR
This comparative evaluation test the following three approaches to text canonicalization on historical German texts from 15th–16th centuries: rule-based, statistical machine translation, and neural machine translation.
Enriching Character-Based Neural Machine Translation with Modern Documents for Achieving an Orthography Consistency in Historical Documents
TLDR
This work compares several character-based machine translation approaches, and proposes a method to profit from modern documents to enrich neural machine translation models to improve the normalization quality of the neural models.
Automatic Normalisation of Historical Text
TLDR
This thesis evaluates three models: a Hidden Markov Model, which has not been previously used for historical text normalisation; a soft attention Neural Network model,Which achieves state-of-the-art normalisation accuracy in all datasets, even when the volume of training data is restricted.
Multi-Task Learning of Keyphrase Boundary Classification
TLDR
This work explores several auxiliary tasks, including semantic super-sense tagging and identification of multi-word expressions, and casts the KBC task as a multi-task learning problem with deep recurrent neural networks.
Spelling normalization of historical documents by using a machine translation approach
TLDR
Three approaches are proposed—based on statistical, neural and character-based machine translation— to adapt the document’s spelling to modern standards to create a digital text version of a historical document.
Context-Aware Text Normalisation for Historical Dialects
TLDR
A multidialect normaliser with a context-aware reranking of the candidates that relies on a word-level n-gram language model that is applied to the five best normalisation candidates and shows that incorporating dialectal information into the training leads to an accuracy improvement on all the datasets.
Improving Lemmatization of Non-Standard Languages with Joint Learning
TLDR
This paper approaches lemmatization as a string-transduction task with an Encoder-Decoder architecture which is enriched with sentence information using a hierarchical sentence encoder and shows significant improvements over the state-of-the-art by fine-tuning the sentence encodings to jointly optimize a bidirectional language model loss.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 31 REFERENCES
Normalizing historical orthography for OCR historical documents using LSTM
TLDR
This paper proposes a new technique to model the target modern language by means of a recurrent neural network with long-short term memory architecture and shows the proposed LSTM model outperforms on normalizing the modern wordform to historical wordform.
Natural Language Processing (Almost) from Scratch
We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity
Sequence to Sequence Learning with Neural Networks
TLDR
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Modernizing historical Slovene words with character-based SMT
We propose a language-independent word normalization method exemplified on modernizing historical Slovene words. Our method relies on character-based statistical machine translation and uses only
(Semi-)Automatic Normalization of Historical Texts using Distance Measures and the Norma tool
TLDR
This paper compares several approaches to normalization with a focus on methods based on string distance measures and evaluates them on two different types of historical texts, showing that a combination of normalization methods produces the best results.
An SMT Approach to Automatic Annotation of Historical Text
TLDR
This paper proposes an approach to tagging and parsing of historical text, using characterbased SMT methods for translating the historical spelling to a modern spelling before applying the NLP tools, and shows that this approach to spelling normalisation is successful even with small amounts of training data and is generalisable to several languages.
VARD2 : a tool for dealing with spelling variation in historical corpora
TLDR
The VARD tool is presented, which facilitates pre-processing of historical corpus data by inserting modern equivalents alongside historical spelling variants, with particular focus on Early Modern English corpora.
Improving sentence compression by learning to predict gaze
We show how eye-tracking corpora can be used to improve sentence compression models, presenting a novel multi-task learning algorithm based on multi-layer LSTMs. We obtain performance competitive
Using Comparable Collections of Historical Texts for Building a Diachronic Dictionary for Spelling Normalization
In this paper, we argue that comparable collections of historical written resources can help overcoming typical challenges posed by heritage texts enhancing spelling normalization, POS-tagging and
Multi-task Sequence to Sequence Learning
TLDR
The results show that training on a small amount of parsing and image caption data can improve the translation quality between English and German by up to 1.5 BLEU points over strong single-task baselines on the WMT benchmarks, and reveal interesting properties of the two unsupervised learning objectives, autoencoder and skip-thought, in the MTL context.
...
1
2
3
4
...