• Corpus ID: 10986188

Class-Based n-gram Models of Natural Language

@article{Brown1992ClassBasedNM,
  title={Class-Based n-gram Models of Natural Language},
  author={Peter F. Brown and Vincent J. Della Pietra and Peter V. de Souza and Jennifer C. Lai and Robert L. Mercer},
  journal={Comput. Linguistics},
  year={1992},
  volume={18},
  pages={467-479}
}
We address the problem of predicting a word from previous words in a sample of text. [] Key Method We also discuss several statistical algorithms for assigning words to classes based on the frequency of their co-occurrence with other words. We find that we are able to extract classes that have the flavor of either syntactically based groupings or semantically based groupings, depending on the nature of the underlying statistics.

Figures and Tables from this paper

Augmenting words with linguistic information for n-gram language models
TLDR
Using part-of-speech tags and syntactic/semantic feature tags obtained with a set of NLP tools developed at Microsoft Research, a reduction in perplexity is obtained compared to the baseline phrase trigram model in a sets of preliminary tests performed on part of the WSJ corpus.
Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes
TLDR
A class-based language model that clusters rare words of similar morphology together improves the prediction of words after histories containing out-of-vocabulary words.
Automatic Determination of a Stochastic Bi-Gram Class Language Model
TLDR
A class-based bigram model determined entirely automatically from written text corpora is developed that is well adapted for highly inflected languages such as French.
Word-phrase-entity language models: getting more mileage out of n-grams
TLDR
This paper focuses on the cold start approach that only assumes availability of the word-level training corpus, as well as a number of generic class definitions, and its iterative optimization algorithm considers alternative parses of the corpus in terms of tokens, re-estimates token n-gram probabilities and also updates within-class distributions.
WordNet and Distributional Analysis: A Class-based Approach to Lexical Discovery
TLDR
An estimate of mutual information is used to calculate what nouns a verb can take as its subjects and objects, based on distributions found within a large corpus of naturally occurring text.
Proposal for a mutual-information based language model
We propose a probabilistic language model that is intended to overcome some of the limitations of the well-known n-gram models, namely the strong dependence of the parameter values of the model on
A Factual Research of Word Class-Based Features For Natural Language Processing
  • Computer Science
  • 2017
TLDR
This paper shows that class-based features extracted from different data sources using alternate word clustering methods can individually impart to the performance gain and analyzed the actual basis of features improve the model accuracy and showed the connection with shrinkage in the model size.
SEMANTIC TEXT CLUSTERS AND WORD CLASSES – THE DUALISM OF MUTUAL INFORMATION AND MAXIMUM LIKELIHOOD
TLDR
This short paper presents two approaches for adaptive unigram language models and illustrates their relation in a more general information theoretic framework.
A novel interpolated N-gram language model based on class hierarchy
TLDR
A novel interpolated language model that combines the interpolation and the backing-off along hierarchical classes based on class hierarchy is proposed and the corresponding approach to the estimation of interpolation coefficients is presented.
Modeling of term-distance and term-occurrence information for improving n-gram language model performance
TLDR
It is shown that word-pairs can be more effectively modeled in terms of both distance and occurrence, and complements well the n-gram model, which inherently suffers from data scarcity in learning long history-contexts.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 24 REFERENCES
Experiments with the Tangora 20,000 word speech recognizer
  • A. Averbuch, L. Bahl, H. Wilkens
  • Computer Science
    ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing
  • 1987
TLDR
The implementation, user interface, and comparative performance of the recognizer is described, which supports spelling and interactive personalization to augment the vocabularies.
A Maximum Likelihood Approach to Continuous Speech Recognition
TLDR
This paper describes a number of statistical models for use in speech recognition, with special attention to determining the parameters for such models from sparse data, and describes two decoding methods appropriate for constrained artificial languages and one appropriate for more realistic decoding tasks.
Context based spelling correction
A Statistical Approach to Machine
TLDR
The application of the statistical approach to translation from French to English and preliminary results are described and the results are given.
A Statistical Approach to Machine Translation
TLDR
The application of the statistical approach to translation from French to English and preliminary results are described and the results are given.
Information Theory and Reliable Communication
TLDR
This chapter discusses Coding for Discrete Sources, Techniques for Coding and Decoding, and Source Coding with a Fidelity Criterion.
Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper
Vibratory power unit for vibrating conveyers and screens comprising an asynchronous polyphase motor, at least one pair of associated unbalanced masses disposed on the shaft of said motor, with the
A statistical approach t o m a c hine translation
  • Computational Linguistics
  • 1990
A statistical approach to machinetranslation
  • Computational Linguistics
  • 1990
Context based spelling
  • 1990
...
1
2
3
...