Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context

@inproceedings{Khandelwal2018SharpNF,
  title={Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context},
  author={Urvashi Khandelwal and He He and Peng Qi and Dan Jurafsky},
  booktitle={ACL},
  year={2018}
}
We know very little about how neural language models (LM) use prior linguistic context. In this paper, we investigate the role of context in an LSTM LM, through ablation studies. Specifically, we analyze the increase in perplexity when prior context words are shuffled, replaced, or dropped. On two standard datasets, Penn Treebank and WikiText-2, we find that the model is capable of using about 200 tokens of context on average, but sharply distinguishes nearby context (recent 50 tokens) from the… 
How LSTM Encodes Syntax: Exploring Context Vectors and Semi-Quantization on Natural Text
TLDR
This work empirically shows that the context update vectors of LSTM are approximately quantized to binary or ternary values to help the language model to count the depth of nesting accurately, and shows that natural clusters of the functional words and the part of speeches that trigger phrases are represented in a small but principal subspace of the context-update vector of L STM.
Dissecting Contextual Word Embeddings: Architecture and Representation
TLDR
There is a tradeoff between speed and accuracy, but all architectures learn high quality contextual representations that outperform word embeddings for four challenging NLP tasks, suggesting that unsupervised biLMs, independent of architecture, are learning much more about the structure of language than previously appreciated.
Linguistic Knowledge and Transferability of Contextual Representations
TLDR
It is found that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge.
TRANSFORMER-XL: LANGUAGE MODELING
  • 2018
We propose a novel neural architecture, Transformer-XL, for modeling longerterm dependency. To address the limitation of fixed-length contexts, we introduce a notion of recurrence by reusing the
LSTM LANGUAGE MODELS
Language models must capture statistical dependencies between words at timescales ranging from very short to very long. Earlier work has demonstrated that dependencies in natural language tend to
Prescient Language Models : Multitask Learning for Long-term Planning in LSTM ’ s
Many applications of language models, such as those in natural language generation, are notoriously susceptible to undesirable and difficult-to-control local behavior such as repetition, truncation,
Context Analysis for Pre-trained Masked Language Models
TLDR
A detailed analysis of contextual impact in Transformer- and BiLSTM-based masked language models suggests significant differences on the contextual impact between the two model architectures.
Measuring context dependency in birdsong using artificial neural networks
TLDR
This work newly estimated the context dependency in birdsongs in a more scalable way using a modern neural-network-based language model whose accessible context length is sufficiently long and studied the relation between the assumed/auto-detected vocabulary size of birdsong and thecontext dependency.
Simple Local Attentions Remain Competitive for Long-Context Tasks
TLDR
Analysis on local attention variants suggests that even the commonly used attention-window overlap is not necessary to achieve good downstream results — using disjoint local attentions, this work is able to build a simpler and more efficient long-doc QA model that matches the performance of Longformer with half of its pretraining compute.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 28 REFERENCES
Contextual LSTM (CLSTM) models for Large scale NLP tasks
TLDR
Results from experiments indicate that using both words and topics as features improves performance of the CLSTM models over baseline L STM models for these tasks, demonstrating the significant benefit of using context appropriately in natural language (NL) tasks.
Larger-Context Language Modelling with Recurrent Neural Network
TLDR
It is discovered that content words, including nouns, adjectives and verbs, benefit most from an increasing number of context sentences, which suggests that larger-context language model improves the unconditional language model by capturing the theme of a document better and more easily.
Regularizing and Optimizing LSTM Language Models
TLDR
This paper proposes the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization and introduces NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user.
Visualizing and Understanding Neural Models in NLP
TLDR
Four strategies for visualizing compositionality in neural models for NLP, inspired by similar work in computer vision, including LSTM-style gates that measure information flow and gradient back-propagation, are described.
Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks
TLDR
This work proposes a framework that facilitates better understanding of the encoded representations of sentence vectors and demonstrates the potential contribution of the approach by analyzing different sentence representation mechanisms.
Pointer Sentinel Mixture Models
TLDR
The pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank while using far fewer parameters than a standard softmax LSTM and the freely available WikiText corpus is introduced.
Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies
TLDR
It is concluded that LSTMs can capture a non-trivial amount of grammatical structure given targeted supervision, but stronger architectures may be required to further reduce errors; furthermore, the language modeling signal is insufficient for capturing syntax-sensitive dependencies, and should be supplemented with more direct supervision if such dependencies need to be captured.
Unbounded cache model for online language modeling with open vocabulary
TLDR
This paper uses a large scale non-parametric memory component that stores all the hidden activations seen in the past and leverages recent advances in approximate nearest neighbor search and quantization algorithms to store millions of representations while searching them efficiently.
Language Modeling with Gated Convolutional Networks
TLDR
A finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens, is developed and is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks.
N-gram Language Modeling using Recurrent Neural Network Estimation
TLDR
While LSTM smoothing for short $n$-gram contexts does not provide significant advantages over classic N-gram models, it becomes effective with long contexts, and depending on the task and amount of data it can match fully recurrent L STM models at about $n=13$.
...
1
2
3
...