• Corpus ID: 12890187

word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method

  title={word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method},
  author={Yoav Goldberg and Omer Levy},
The word2vec software of Tomas Mikolov and colleagues (this https URL ) has gained a lot of traction lately, and provides state-of-the-art word embeddings. The learning models behind the software are described in two research papers. We found the description of the models in these papers to be somewhat cryptic and hard to follow. While the motivations and presentation may be obvious to the neural-networks language-modeling crowd, we had to struggle quite a bit to figure out the rationale behind… 

word2vec Parameter Learning Explained

Detailed derivations and explanations of the parameter update equations of the word2vec models, including the original continuous bag-of-word (CBOW) and skip-gram (SG) models, as well as advanced optimization techniques, including hierarchical softmax and negative sampling are provided.

Bayesian Paragraph Vectors

This work develops an unsupervised generative model whose maximum likelihood solution corresponds to traditional paragraph vectors and finds that the entropy of paragraph vectors decreases with the length of documents, and that information about posterior uncertainty improves performance in supervised learning tasks such as sentiment analysis and paraphrase detection.

dish2vec: A Comparison of Word Embedding Methods in an Unsupervised Setting

This paper elaborates on three popular word embedding methods; GloVe and two versions of word2vec: continuous skip-gram and continuous bag-of-words and addresses instability of the methods with respect to the hyperparameters.

Modeling Order in Neural Word Embeddings at Scale

A new neural language model incorporating both word order and character order in its embedding is proposed, which produces several vector spaces with meaningful substructure, as evidenced by its performance on a recent word-analogy task.

Random Walks on Context Spaces: Towards an Explanation of the Mysteries of Semantic Word Embeddings

A rigorous mathematical analysis is performed using the model priors to arrive at a simple closed form expression that approximately relates co-occurrence statistics and word embeddings, and leads to good solutions to analogy tasks.

The Spectral Underpinning of word2vec

A rigorous analysis of the highly nonlinear functional of word2vec suggests thatword2vec may be primarily driven by an underlying spectral method, which may open the door to obtaining provable guarantees forWord2vec.

Intrinsic Evaluation of Lithuanian Word Embeddings Using WordNet

This work has determined the superiority of the continuous bag-of-words over the skip-gram architecture; while the training algorithm and dimensionality showed no significant impact on the results.

On the Effective Use of Pretraining for Natural Language Inference

It is shown that pretrained embeddings outperform both random and retrofitted ones in a large NLI corpus and two principled approaches to initializing the rest of the model parameters, Gaussian and orthogonal, are explored.

Revisiting Skip-Gram Negative Sampling Model with Rectification

This work revisits skip-gram negative sampling and rectifies the SGNS model with quadratic regularization, and shows that this simple modification suffices to structure the solution in the desired manner.

The Corpus Replication Task

  • T. Eichinger
  • Computer Science
    2017 International Conference on Computational Science and Computational Intelligence (CSCI)
  • 2017
This work revisits the well-known word embedding algorithm word2vec and proposes a bottom-up point of view approach to solve the Corpus Replication Task to provide partial answers to two questions: which kind of relations are representable in continuous space and how are relations built.



Efficient Estimation of Word Representations in Vector Space

Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.

Distributed Representations of Words and Phrases and their Compositionality

This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.