A Neural Probabilistic Language Model

@inproceedings{Bengio2000ANP,
  title={A Neural Probabilistic Language Model},
  author={Yoshua Bengio and R{\'e}jean Ducharme and Pascal Vincent and Christian Janvin},
  booktitle={J. Mach. Learn. Res.},
  year={2000}
}
A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the… 
Connectionist Language Model for Polish
TLDR
A connectionist language model is described, which may be used as an alternative to the well known n-gram models, and perplexity is used as measure of language model quality.
Neural Probabilistic Language Models
TLDR
This work proposes to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences, and incorporates this new language model into a state-of-the-art speech recognizer of conversational speech.
Hierarchical Distributed Representations for Statistical Language Modeling
TLDR
This paper shows how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model, and demonstrates consistent improvement over class-based bigram models.
Distributed Representation of Words in Vector Space for Kannada Language
TLDR
A distributed representation for Kannada words is proposed using an optimal neural network model and combining various known techniques to improve the vector space representation.
Exploiting Future Word Contexts in Neural Network Language Models for Speech Recognition
TLDR
A novel neural network language model structure, the succeeding-word RNNLM, su-RNNLM is proposed, which is more efficient in training than bi-directional models and can be applied to lattice rescoring.
Hierarchical Probabilistic Neural Network Language Model
TLDR
A hierarchical decomposition of the conditional probabilities that yields a speed-up of about 200 both during training and recognition, constrained by the prior knowledge extracted from the WordNet semantic hierarchy is introduced.
Bayesian Recurrent Neural Network for Language Modeling
  • Jen-Tzung Chien, Y. Ku
  • Computer Science
    IEEE Transactions on Neural Networks and Learning Systems
  • 2016
TLDR
A Bayesian approach to regularize the RNN-LM and apply it for continuous speech recognition by compensating for the uncertainty of the estimated model parameters, which is represented by a Gaussian prior.
Building neural network language model with POS-based negative sampling and stochastic conjugate gradient descent
TLDR
This paper proposes a gradient descent algorithm based on stochastic conjugate gradient to accelerate the convergence of the neural network’s parameters and proposes a negative sampling algorithm which can optimize the negative sampling process and improve the quality of the final language model.
Word Embeddings for Natural Language Processing
TLDR
A novel model that jointly learns word embeddings and their summation is introduced, which shows that good performance is achieved in sentiment classification of short and long text documents with a convolutional neural network.
Knowledge integration into language models: a random forest approach
TLDR
This work formalizes the method of integrating various knowledge sources into language models with random forests and illustrates its applicability with three innovative applications: morphological LMs of Arabic, prosodic LMs for speech recognition and combination of syntactic and topic information in LMs.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 55 REFERENCES
Connectionist language modeling for large vocabulary continuous speech recognition
  • Holger Schwenk, J. Gauvain
  • Computer Science
    2002 IEEE International Conference on Acoustics, Speech, and Signal Processing
  • 2002
TLDR
The connectionist language model is being evaluated on the DARPA HUB5 conversational telephone speech recognition task and preliminary results show consistent improvements in both perplexity and word error rate.
Natural Language Processingwith Modular Neural Networks and Distributed Lexicon
An approach to connectionist natural language processing is proposed, which is based on hierarchically organized modular Parallel Distributed Processing (PDP) networks and a central lexicon of
Class-Based n-gram Models of Natural Language
TLDR
This work addresses the problem of predicting a word from previous words in a sample of text and discusses n-gram models based on classes of words, finding that these models are able to extract classes that have the flavor of either syntactically based groupings or semanticallybased groupings, depending on the nature of the underlying statistics.
An empirical study of smoothing techniques for language modeling
TLDR
This work surveys the most widely-used algorithms for smoothing models for language n -gram modeling, and presents an extensive empirical comparison of several of these smoothing techniques, including those described by Jelinek and Mercer (1980), and introduces methodologies for analyzing smoothing algorithm efficacy in detail.
A bit of progress in language modeling
TLDR
A combination of all techniques together to a Katz smoothed trigram model with no count cutoffs achieves perplexity reductions between 38 and 50% (1 bit of entropy), depending on training data size, as well as a word error rate reduction of 8.9%.
Estimation of probabilities from sparse data for the language model component of a speech recognizer
  • S. Katz
  • Computer Science
    IEEE Trans. Acoust. Speech Signal Process.
  • 1987
TLDR
The model offers, via a nonlinear recursive procedure, a computation and space efficient solution to the problem of estimating probabilities from sparse data, and compares favorably to other proposed methods.
Distributional Clustering of English Words
TLDR
Deterministic annealing is used to find lowest distortion sets of clusters: as the annealed parameter increases, existing clusters become unstable and subdivide, yielding a hierarchical "soft" clustering of the data.
Finding Structure in Time
TLDR
A proposal along these lines first described by Jordan (1986) which involves the use of recurrent links in order to provide networks with a dynamic memory and suggests a method for representing lexical categories and the type/token distinction is developed.
Taking on the curse of dimensionality in joint distributions using neural networks
TLDR
This paper proposes a new architecture for modeling high-dimensional data that requires resources that grow at most as the square of the number of variables, using a multilayer neural network to represent the joint distribution of the variables as the product of conditional distributions.
...
1
2
3
4
5
...