The Roles of Language Models and Hierarchical Models in Neural Sequence-to-Sequence Prediction

  title={The Roles of Language Models and Hierarchical Models in Neural Sequence-to-Sequence Prediction},
  author={Felix Stahlberg},
With the advent of deep learning, research in many areas of machine learning is converging towards the same set of methods and models. For example, long short-term memory networks (Hochreiter and Schmidhuber, 1997) are not only popular for various tasks in natural language processing (NLP) such as speech recognition, machine translation, handwriting recognition, syntactic parsing, etc., but they are also applicable to seemingly unrelated fields such as bioinformatics (Min et al., 2016). Recent… 
Searching for Search Errors in Neural Morphological Inflection
Observations suggest that the poor calibration of many neural models may stem from characteristics of a specific subset of tasks rather than general ill-suitedness of such models for language generation.
Spoken Language 'Grammatical Error Correction'
Experimental results show that DD effectively regulates GEC input; endto-end training works well when fine-tuned on limited labelled in-domain data; and improving DD by incorporating acoustic information helps improve spoken GEC.


Deconvolution-Based Global Decoding for Neural Machine Translation
A new NMT model is proposed that decodes the sequence with the guidance of its structural prediction of the context of the target sequence so that the translation can be freed from the bind of sequential order.
Pointing the Unknown Words
A novel way to deal with the rare and unseen words for the neural network models using attention is proposed using attention, which uses two softmax layers in order to predict the next word in conditional language models.
Global-Context Neural Machine Translation through Target-Side Attentive Residual Connections
Att attentive residual connections from the previously translated words to the output of the decoder enable the learning of longer-range dependencies between words, which outperforms strong neural MT baselines on three language pairs, as well as a neural language modeling baseline.
The Importance of Being Recurrent for Modeling Hierarchical Structure
This work compares the two architectures—recurrent versus non-recurrent—with respect to their ability to model hierarchical structure and finds that recurrency is indeed important for this purpose.
Sequence to Sequence Learning with Neural Networks
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Lattice-to-sequence attentional Neural Machine Translation models
Lattice-to-sequence attentional NMT models are proposed, which generalize the standard Recurrent Neural Network encoders to lattice topology and take as input a word lattice which compactly encodes many tokenization alternatives, and learn to generate the hidden state for the current step from multiple inputs and hidden states in previous steps.
Neural Language Correction with Character-Based Attention
A neural network-based approach to language correction that achieves a state-of-the-art $F_{0.5}$-score on the CoNLL 2014 Shared Task and demonstrates that training the network on additional data with synthesized errors can improve performance.
Self-Attentive Residual Decoder for Neural Machine Translation
A target-side-attentive residual recurrent network for decoding, where attention over previous words contributes directly to the prediction of the next word and is able to emphasize any of the previously translated words, hence it gains access to a wider context.
Language Models are Unsupervised Multitask Learners
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Improving Neural Machine Translation Models with Monolingual Data
This work pairs monolingual training data with an automatic back-translation, and can treat it as additional parallel training data, and obtains substantial improvements on the WMT 15 task English German, and for the low-resourced IWSLT 14 task Turkish->English.