Stolen Probability: A Structural Weakness of Neural Language Models

  title={Stolen Probability: A Structural Weakness of Neural Language Models},
  author={David Demeter and Gregory J. Kimmel and Doug Downey},
Neural Network Language Models (NNLMs) generate probability distributions by applying a softmax function to a distance metric formed by taking the dot product of a prediction vector with all word vectors in a high-dimensional embedding space. The dot-product distance metric forms part of the inductive bias of NNLMs. Although NNLMs optimize well with this inductive bias, we show that this results in a sub-optimal ordering of the embedding space that structurally impoverishes some words at the… 

Figures and Tables from this paper

Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions

This work discovers that a single hidden state cannot produce all probability distributions regardless of the LM size or training data size, and proposes multi-facet softmax (MFS) to address the limitations of MoS.

Shape of Elephant: Study of Macro Properties of Word Embeddings Spaces

This paper demonstrates that a typical word embeddings cloud is shaped as a high-dimensional simplex with interpretable vertices and proposes a simple yet effective method for enumeration of these vertices.

Embedding Words in Non-Vector Space with Unsupervised Graph Learning

GraphGlove is introduced: unsupervised graph word representations which are learned end-to-end in a differentiable weighted graph and show that graph-based representations substantially outperform vector-based methods on word similarity and analogy tasks.

Low-Rank Softmax Can Have Unargmaxable Classes in Theory but Rarely in Practice

Algorithms to detect unargmaxable tokens in public models are developed and it is found that 13 out of 150 models do indeed have such tokens; however, they are very infrequent and unlikely to impact model quality.

On Long-Tailed Phenomena in Neural Machine Translation

This work quantitatively characterize long-tailed phenomena at two levels of abstraction, namely, token classification and sequence generation, and proposes a new loss function, the Anti-Focal loss, to better adapt model training to the structural dependencies of conditional text generation.

Improving Low Compute Language Modeling with In-Domain Embedding Initialisation

It is shown that for the target setting in English, initialising and freezing input embeddings using in-domain data can improve language model performance by providing a useful representation of rare words, and this pattern holds across several different domains.

Rare Tokens Degenerate All Tokens: Improving Neural Text Generation via Adaptive Gradient Gating for Rare Token Embeddings

This study analyzes the training dynamics of the token embeddings focusing on rare token embedding and proposes a novel method called, adaptive gradient gating (AGG), which addresses the degeneration problem by gating the specific part of the gradient for rare tokenembeddings.

Query-Key Normalization for Transformers

QKNorm, a normalization technique that modifies the attention mechanism to make the softmax function less prone to arbitrary saturation without sacrificing expressivity is proposed.

Real-time Classification, Geolocation and Interactive Visualization of COVID-19 Information Shared on Social Media to Better Understand Global Developments

A prototype dashboard for the real-time classification, geolocation and interactive visualization of COVID-19 tweets that addresses issues and a novel L2 classification layer that outperforms linear layers on a dataset of respiratory virus tweets are described.

Removing Partial Mismatches in Unsupervised Image Captioning



A Neural Probabilistic Language Model

This work proposes to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences.

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

It is shown that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck, and a simple and effective method is proposed to address this issue.

Regularizing and Optimizing LSTM Language Models

This paper proposes the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization and introduces NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user.

Recurrent neural network based language model

Results indicate that it is possible to obtain around 50% reduction of perplexity by using mixture of several RNN LMs, compared to a state of the art backoff language model.

The strange geometry of skip-gram with negative sampling

It is found that vector positions are not simply determined by semantic similarity, but rather occupy a narrow cone, diametrically opposed to the context vectors, and that this geometric concentration depends on the ratio of positive to negative examples.

Efficient softmax approximation for GPUs

This work proposes an approximate strategy to efficiently train neural network based language models over very large vocabularies by exploiting the unbalanced word distribution to form clusters that explicitly minimize the expectation of computational complexity.

Factors Influencing the Surprising Instability of Word Embeddings

It is shown that even relatively high frequency words (100-200 occurrences) are often unstable, and empirical evidence is provided for how various factors contribute to the stability of word embeddings, and the effects of stability on downstream tasks are analyzed.

Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family

Several loss functions from this family of loss functions, called the spherical family, are explored as possible alternatives to the traditional log-softmax loss and surprisingly outperform it in experiments on MNIST and CIFAR-10, suggesting that they might be relevant in a broad range of applications.

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

This work proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence, which consists of a segment-level recurrence mechanism and a novel positional encoding scheme.