• Corpus ID: 14992849

Learning word embeddings efficiently with noise-contrastive estimation

@inproceedings{Mnih2013LearningWE,
  title={Learning word embeddings efficiently with noise-contrastive estimation},
  author={Andriy Mnih and Koray Kavukcuoglu},
  booktitle={NIPS},
  year={2013}
}
Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. [] Key Method Our approach is simpler, faster, and produces better results than the current state-of-the-art method. We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing…

Tables from this paper

Improving Word Embeddings Using Kernel PCA

This paper uses word embeddings generated using both word2vec and fastText models and enrich them with morphological information of words, derived from kernel principal component analysis (KPCA) of word similarity matrices, in order to reduce training time and enhance their performance.

Exponential Word Embeddings: Models and Approximate Learning

This thesis shows that a representation based on multiple vectors per word easily overcomes this limitation by having different vectors representing the different meanings of a word, which is especially beneficial when noisy and little training data is available.

A Framework for Learning Knowledge-Powered Word Embedding

A novel neural network architecture is introduced that leverages both contextual information and morphological word similarity to learn word embeddings and demonstrates that the proposed method can greatly enhance the effectiveness of wordembeddings.

Word and Document Embeddings based on Neural Network Approaches

This thesis presents some guidelines for how to generate a good word embedding, and introduces the joint training of Chinese character and word in Chinese character or word representation.

Learning Compact Neural Word Embeddings by Parameter Space Sharing

This paper investigates the trade-off between quality and model size of embedding vectors for several linguistic benchmark datasets, and shows that the proposed learning framework can significantly reduce the model size while maintaining the task performance of conventional methods.

Word Embeddings for Natural Language Processing

A novel model that jointly learns word embeddings and their summation is introduced, which shows that good performance is achieved in sentiment classification of short and long text documents with a convolutional neural network.

PAWE: Polysemy Aware Word Embeddings

This work develops a new word embedding model that can accurately represent such words by automatically learning multiple representations for each word, whilst remaining computationally efficient.

Learning Effective Word Embedding using Morphological Word Similarity

A novel neural network architecture is introduced that leverages both contextual information and morphological word similarity to learn word embeddings and is able to refine the pre-defined morphological knowledge and obtain more accurate word similarity.

Sparse Word Embeddings Using ℓ1 Regularized Online Learning

Inspired by the success of sparse models in enhancing interpretability, it is proposed to introduce sparse constraint into Word2Vec and the proposed model can obtain state-of-the-art results with better interpretability using less than 10% non-zero elements.

Mixed Membership Word Embeddings : Corpus-Specific Embeddings Without Big Data

This work proposes a probabilistic model-based word embedding method which can recover high-dimensional embeddings without big data, due to a sensible parameter sharing scheme, and leverages the notion of mixed membership modeling.
...

References

SHOWING 1-10 OF 20 REFERENCES

Word Representations: A Simple and General Method for Semi-Supervised Learning

This work evaluates Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeds of words on both NER and chunking, and finds that each of the three word representations improves the accuracy of these baselines.

Efficient Estimation of Word Representations in Vector Space

Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.

A fast and simple algorithm for training neural probabilistic language models

This work proposes a fast and simple algorithm for training NPLMs based on noise-contrastive estimation, a newly introduced procedure for estimating unnormalized continuous distributions and demonstrates the scalability of the proposed approach by training several neural language models on a 47M-word corpus with a 80K-word vocabulary.

Linguistic Regularities in Continuous Space Word Representations

The vector-space word representations that are implicitly learned by the input-layer weights are found to be surprisingly good at capturing syntactic and semantic regularities in language, and that each relationship is characterized by a relation-specific vector offset.

A Neural Probabilistic Language Model

This work proposes to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences.

A unified architecture for natural language processing: deep neural networks with multitask learning

We describe a single convolutional neural network architecture that, given a sentence, outputs a host of language processing predictions: part-of-speech tags, chunks, named entity tags, semantic

A Scalable Hierarchical Distributed Language Model

A fast hierarchical language model along with a simple feature-based algorithm for automatic construction of word trees from the data are introduced and it is shown that the resulting models can outperform non-hierarchical neural models as well as the best n-gram models.

Hierarchical Probabilistic Neural Network Language Model

A hierarchical decomposition of the conditional probabilities that yields a speed-up of about 200 both during training and recognition, constrained by the prior knowledge extracted from the WordNet semantic hierarchy is introduced.

The Microsoft Research Sentence Completion Challenge

This work presents the MSR Sentence Completion Challenge Data, which consists of 1,040 sentences, each of which has four impostor sentences, in which a single (fixed) word in the original sentence has been replaced by an imposter word with similar occurrence statistics.

Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

The basic idea is to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise and it is shown that the new method strikes a competitive trade-off in comparison to other estimation methods for unnormalized models.