Stolen Probability: A Structural Weakness of Neural Language Models

  title={Stolen Probability: A Structural Weakness of Neural Language Models},
  author={David Demeter and Gregory J. Kimmel and Doug Downey},
Neural Network Language Models (NNLMs) generate probability distributions by applying a softmax function to a distance metric formed by taking the dot product of a prediction vector with all word vectors in a high-dimensional embedding space. The dot-product distance metric forms part of the inductive bias of NNLMs. Although NNLMs optimize well with this inductive bias, we show that this results in a sub-optimal ordering of the embedding space that structurally impoverishes some words at the… Expand
Shape of Elephant: Study of Macro Properties of Word Embeddings Spaces
This paper demonstrates that a typical word embeddings cloud is shaped as a high-dimensional simplex with interpretable vertices and proposes a simple yet effective method for enumeration of these vertices. Expand
Embedding Words in Non-Vector Space with Unsupervised Graph Learning
GraphGlove is introduced: unsupervised graph word representations which are learned end-to-end in a differentiable weighted graph and show that graph-based representations substantially outperform vector-based methods on word similarity and analogy tasks. Expand
On Long-Tailed Phenomena in Neural Machine Translation
This work quantitatively characterize long-tailed phenomena at two levels of abstraction, namely, token classification and sequence generation, and proposes a new loss function, the Anti-Focal loss, to better adapt model training to the structural dependencies of conditional text generation. Expand
Improving Low Compute Language Modeling with In-Domain Embedding Initialisation
It is shown that for the target setting in English, initialising and freezing input embeddings using in-domain data can improve language model performance by providing a useful representation of rare words, and this pattern holds across several different domains. Expand
Query-Key Normalization for Transformers
QKNorm, a normalization technique that modifies the attention mechanism to make the softmax function less prone to arbitrary saturation without sacrificing expressivity is proposed. Expand
Real-time Classification, Geolocation and Interactive Visualization of COVID-19 Information Shared on Social Media to Better Understand Global Developments
A prototype dashboard for the real-time classification, geolocation and interactive visualization of COVID-19 tweets that addresses issues and a novel L2 classification layer that outperforms linear layers on a dataset of respiratory virus tweets are described. Expand


A Neural Probabilistic Language Model
This work proposes to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. Expand
Pointer Sentinel Mixture Models
The pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank while using far fewer parameters than a standard softmax LSTM and the freely available WikiText corpus is introduced. Expand
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
It is shown that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck, and a simple and effective method is proposed to address this issue. Expand
Regularizing and Optimizing LSTM Language Models
This paper proposes the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization and introduces NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user. Expand
Recurrent neural network based language model
Results indicate that it is possible to obtain around 50% reduction of perplexity by using mixture of several RNN LMs, compared to a state of the art backoff language model. Expand
The strange geometry of skip-gram with negative sampling
It is found that vector positions are not simply determined by semantic similarity, but rather occupy a narrow cone, diametrically opposed to the context vectors, and that this geometric concentration depends on the ratio of positive to negative examples. Expand
Efficient softmax approximation for GPUs
This work proposes an approximate strategy to efficiently train neural network based language models over very large vocabularies by exploiting the unbalanced word distribution to form clusters that explicitly minimize the expectation of computational complexity. Expand
Factors Influencing the Surprising Instability of Word Embeddings
It is shown that even relatively high frequency words (100-200 occurrences) are often unstable, and empirical evidence is provided for how various factors contribute to the stability of word embeddings, and the effects of stability on downstream tasks are analyzed. Expand
Language Models are Unsupervised Multitask Learners
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. Expand
An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family
Several loss functions from this family of loss functions, called the spherical family, are explored as possible alternatives to the traditional log-softmax loss and surprisingly outperform it in experiments on MNIST and CIFAR-10, suggesting that they might be relevant in a broad range of applications. Expand