• Corpus ID: 85531558

Multilevel Text Normalization with Sequence-to-Sequence Networks and Multisource Learning

  title={Multilevel Text Normalization with Sequence-to-Sequence Networks and Multisource Learning},
  author={Tatyana Ruzsics and Tanja Samard{\vz}i{\'c}},
We define multilevel text normalization as sequence-to-sequence processing that transforms naturally noisy text into a sequence of normalized units of meaning (morphemes) in three steps: 1) writing normalization, 2) lemmatization, 3) canonical segmentation. These steps are traditionally considered separate NLP tasks, with diverse solutions, evaluation schemes and data sources. We exploit the fact that all these tasks involve sub-word sequence-to-sequence transformation to propose a systematic… 

Figures and Tables from this paper



Learning attention for historical text normalization by learning to pronounce

Interestingly, it is observed that, as previously conjectured, multi-task learning can learn to focus attention during decoding, in ways remarkably similar to recently proposed attention mechanisms, which is an important step toward understanding how MTL works.

Encoder-Decoder Methods for Text Normalization

This work modify the decoding stage of a plain ED model to include target-side language models operating at different levels of granularity: characters and words, and shows that this approach results in an improvement over the CSMT state-of-the-art.

An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model

This work demonstrates that the use of shallow fusion with an neural LM with wordpieces yields a 9.1% relative word error rate reduction over the authors' competitive attention-based sequence-to-sequence model, obviating the need for second-pass rescoring on Google Voice Search.

Context Sensitive Lemmatization Using Two Successive Bidirectional Gated Recurrent Networks

It is found that except Bengali, the proposed method outperforms Lemming and Morfette on the other languages and no other expensive morphological attribute is required for joint learning.

Sequence to Sequence Learning with Neural Networks

This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

Neural Sequence-to-sequence Learning of Internal Word Structure

This paper presents a neural encoder-decoder model that combines character-level sequence-to-sequence transformation with a language model over canonical segments for learning canonical morphological segmentation and shows that including corpus counts is beneficial to both approaches.

Context Sensitive Neural Lemmatization with Lematus

Lematus, a lemmatizer based on a standard encoder-decoder architecture, which incorporates character-level sentence context, is introduced, and it is shown that including context significantly improves results against a context-free version of the model.

Towards Better Decoding and Language Model Integration in Sequence to Sequence Models

An attention-based seq2seq speech recognition system that directly transcribes recordings into characters is analysed, observing two shortcomings: overconfidence in its predictions and a tendency to produce incomplete transcriptions when language models are used.

Improving historical spelling normalization with bi-directional LSTMs and multi-task learning

This work explores the suitability of a deep neural network architecture for historical documents processing, particularly a deep bi-LSTM network applied on a character level, and shows that multi-task learning with additional normalization data can improve the model’s performance further.

Normalisation of Historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting

This paper presents a Levenshtein-based approach to normalisation of historical text to a modern spelling, and shows that this method is successful both in terms of normalisation accuracy, and by the performance of a standard modern tagger applied to the historical text.