A Neural Probabilistic Language Model

  title={A Neural Probabilistic Language Model},
  author={Yoshua Bengio and R{\'e}jean Ducharme and Pascal Vincent and Christian Janvin},
  booktitle={Journal of machine learning research},
A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the… 

Neural Probabilistic Language Models

This work proposes to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences, and incorporates this new language model into a state-of-the-art speech recognizer of conversational speech.

Connectionist Language Model for Polish

A connectionist language model is described, which may be used as an alternative to the well known n-gram models, and perplexity is used as measure of language model quality.

Hierarchical Distributed Representations for Statistical Language Modeling

This paper shows how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model, and demonstrates consistent improvement over class-based bigram models.

Distributed Representation of Words in Vector Space for Kannada Language

A distributed representation for Kannada words is proposed using an optimal neural network model and combining various known techniques to improve the vector space representation.

Exploiting Future Word Contexts in Neural Network Language Models for Speech Recognition

A novel neural network language model structure, the succeeding-word RNNLM, su-RNNLM is proposed, which is more efficient in training than bi-directional models and can be applied to lattice rescoring.

Hierarchical Probabilistic Neural Network Language Model

A hierarchical decomposition of the conditional probabilities that yields a speed-up of about 200 both during training and recognition, constrained by the prior knowledge extracted from the WordNet semantic hierarchy is introduced.

Bayesian Recurrent Neural Network for Language Modeling

A Bayesian approach to regularize the RNN-LM and apply it for continuous speech recognition by compensating for the uncertainty of the estimated model parameters, which is represented by a Gaussian prior.

Building neural network language model with POS-based negative sampling and stochastic conjugate gradient descent

This paper proposes a gradient descent algorithm based on stochastic conjugate gradient to accelerate the convergence of the neural network’s parameters and proposes a negative sampling algorithm which can optimize the negative sampling process and improve the quality of the final language model.

Word Embeddings for Natural Language Processing

A novel model that jointly learns word embeddings and their summation is introduced, which shows that good performance is achieved in sentiment classification of short and long text documents with a convolutional neural network.

A Survey on Language Modeling using Neural Networks

The survey will summarize and group literature that has addressed the curse of dimensionality in language modeling, and promising recent research on Neural Network techniques applied to language modeling are examined in order to overcome the mentioned curse and to achieve better generalizations over word sequences.



Connectionist language modeling for large vocabulary continuous speech recognition

  • Holger SchwenkJ. Gauvain
  • Computer Science
    2002 IEEE International Conference on Acoustics, Speech, and Signal Processing
  • 2002
The connectionist language model is being evaluated on the DARPA HUB5 conversational telephone speech recognition task and preliminary results show consistent improvements in both perplexity and word error rate.

Natural Language Processingwith Modular Neural Networks and Distributed Lexicon

An approach to connectionist natural language processing is proposed, which is based on hierarchically organized modular Parallel Distributed Processing (PDP) networks and a central lexicon of

Natural Language Processing With Modular PDP Networks and Distributed Lexicon

An approach to connectionist natural language processing is proposed, which is based on hierarchically organized modular parallel distributed processing (PDP) networks and a central lexicon of

Class-Based n-gram Models of Natural Language

This work addresses the problem of predicting a word from previous words in a sample of text and discusses n-gram models based on classes of words, finding that these models are able to extract classes that have the flavor of either syntactically based groupings or semanticallybased groupings, depending on the nature of the underlying statistics.

A bit of progress in language modeling

A combination of all techniques together to a Katz smoothed trigram model with no count cutoffs achieves perplexity reductions between 38 and 50% (1 bit of entropy), depending on training data size, as well as a word error rate reduction of 8.9%.

Estimation of probabilities from sparse data for the language model component of a speech recognizer

  • Slava M. Katz
  • Computer Science
    IEEE Trans. Acoust. Speech Signal Process.
  • 1987
The model offers, via a nonlinear recursive procedure, a computation and space efficient solution to the problem of estimating probabilities from sparse data, and compares favorably to other proposed methods.

Distributional Clustering of English Words

Deterministic annealing is used to find lowest distortion sets of clusters: as the annealed parameter increases, existing clusters become unstable and subdivide, yielding a hierarchical "soft" clustering of the data.

Finding Structure in Time

A proposal along these lines first described by Jordan (1986) which involves the use of recurrent links in order to provide networks with a dynamic memory and suggests a method for representing lexical categories and the type/token distinction is developed.

Taking on the curse of dimensionality in joint distributions using neural networks

This paper proposes a new architecture for modeling high-dimensional data that requires resources that grow at most as the square of the number of variables, using a multilayer neural network to represent the joint distribution of the variables as the product of conditional distributions.

Can artificial neural networks learn language models?

This paper investigated an alternative way to build language models, i.e., using artificial neural networks to learn the language model, and shows that the neural network can learn a language model that has performance even better than standard statistical methods.