• Corpus ID: 1190093

Neural Word Embedding as Implicit Matrix Factorization

  title={Neural Word Embedding as Implicit Matrix Factorization},
  author={Omer Levy and Yoav Goldberg},
We analyze skip-gram with negative-sampling (SGNS), a word embedding method introduced by Mikolov et al., and show that it is implicitly factorizing a word-context matrix, whose cells are the pointwise mutual information (PMI) of the respective word and context pairs, shifted by a global constant. We find that another embedding method, NCE, is implicitly factorizing a similar matrix, where each cell is the (shifted) log conditional probability of a word given its context. We show that using a… 

Tables from this paper

Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective

It is pointed out that SGNS is essentially a representation learning method, which learns to represent the co-occurrence vector for a word, and that extended supervised word embedding can be established based on the proposed representation learning view.

Fast PMI-Based Word Embedding with Efficient Use of Unobserved Patterns

A new word embedding algorithm that works on a smoothed Positive Pointwise Mutual Information (PPMI) matrix which is obtained from the word-word co-occurrence counts and a kernel similarity measure for the latent space that can effectively calculate the similarities in high dimensions is proposed.

Word Embeddings via Tensor Factorization

It is shown that embeddings based on tensor factorization can be used to discern the various meanings of polysemous words without being explicitly trained to do so, and motivate the intuition behind why this works in a way that doesn't with existing methods.

WordRank: Learning Word Embeddings via Robust Ranking

This paper argues that word embedding can be naturally viewed as a ranking problem due to the ranking nature of the evaluation metrics, and proposes a novel framework WordRank that efficiently estimates word representations via robust ranking, in which the attention mechanism and robustness to noise are readily achieved via the DCG-like ranking losses.

Exponential Family Word Embeddings: An Iterative Approach for Learning Word Vectors

This work proposes an iterative algorithm for computing word vectors based on modeling word co-occurrence matrices with Generalized Low Rank Models and demonstrates that multiple iterations of the algorithm improves results over the GloVe method on the Google word analogy similarity task.

PMIVec: a word embedding model guided by point-wise mutual information criterion

This paper proposes a novel word embedding method based on point-wise mutual information criterion (PMIVec), which explicitly learns the context vector as the final word representation for each word, while discarding the word vector.

A Generative Word Embedding Model and its Low Rank Positive Semidefinite Solution

This work proposes a generative word embedding model, which is easy to interpret, and can serve as a basis of more sophisticated latent factor models, and is competitive to word2vec, and better than other MF-based methods.

Continuous Word Embedding Fusion via Spectral Decomposition

This paper builds on the established view of word embeddings as matrix factorizations to present a spectral algorithm for this task, and demonstrates that the method is able to embed the new words efficiently into the original embedding space.

Spectral Word Embedding with Negative Sampling

This work examines the notion of ``negative examples'', the unobserved or insignificant word-context co-occurrences, in spectral methods and proposes a new formulation for the word embedding problem by proposing a new intuitive objective function that perfectly justifies the use of negative examples.

Word Embedding With Zipf’s Context

A simpler but efficient word embedding method based on cooccurrence matrix factorization according to Zipf’s word frequency law, which shows a comparable performance though it is much simpler than the neural language models.



Linguistic Regularities in Sparse and Explicit Word Representations

It is demonstrated that analogy recovery is not restricted to neural word embeddings, and that a similar amount of relational similarities can be recovered from traditional distributional word representations.

Distributed Representations of Words and Phrases and their Compositionality

This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

Linguistic Regularities in Continuous Space Word Representations

The vector-space word representations that are implicitly learned by the input-layer weights are found to be surprisingly good at capturing syntactic and semantic regularities in language, and that each relationship is characterized by a relation-specific vector offset.

Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD

This article investigates the use of three further factors—namely, the application of stop-lists, word stemming, and dimensionality reduction using singular value decomposition (SVD)—that have been used to provide improved performance elsewhere and introduces an additional semantic task and explores the advantages of using a much larger corpus.

word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method

This note is an attempt to explain equation (4) (negative sampling) in "Distributed Representations of Words and Phrases and their Compositionality" by Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean.

Word Representations: A Simple and General Method for Semi-Supervised Learning

This work evaluates Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeds of words on both NER and chunking, and finds that each of the three word representations improves the accuracy of these baselines.

A Neural Probabilistic Language Model

This work proposes to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences.

Dependency-Based Word Embeddings

The skip-gram model with negative sampling introduced by Mikolov et al. is generalized to include arbitrary contexts, and experiments with dependency-based contexts are performed, showing that they produce markedly different embeddings.

Efficient Estimation of Word Representations in Vector Space

Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.

Distributional Memory: A General Framework for Corpus-Based Semantics

The Distributional Memory approach is shown to be tenable despite the constraints imposed by its multi-purpose nature, and performs competitively against task-specific algorithms recently reported in the literature for the same tasks, and against several state-of-the-art methods.