Learn More
The CMU Statistical Language Modeling toolkit was released in 1994 in order to facilitate the construction and testing of bigram and trigram language models. It is currently in use in over 40 academic, government and industrial laboratories in over 12 countries. This paper presents a new version of the toolkit. We outline the conventional language modeling(More)
When documents are organized in a large number of topic categories, the categories are often arranged in a hierarchy. The U.S. patent database and Yahoo are two examples. This paper shows that the accuracy of a naive Bayes text classiier can be signiicantly improved by taking advantage of a hierarchy o f classes. We adopt an established statistical(More)
—In certain contexts, maximum entropy (ME) mod-eling can be viewed as maximum likelihood (ML) training for exponential models, and like other ML methods is prone to overfitting of training data. Several smoothing methods for ME models have been proposed to address this problem, but previous results do not make it clear how these smoothing methods compare(More)
In order for speech recognizers to deal with increased task perplexity, speaker variation, and environment variation, improved speech recognition is critical. Steady progress has been made along these three dimensions at Carnegie Mellon. In this paper, we review the SPHINX-II speech recognition system and summarize our recent efforts on improved speech(More)
We introduce an exponential language model which models a whole sentence or utterance as a single unit. By avoiding the chain rule, the model treats each sentence as a " bag of features " , where features are arbitrary computable properties of the sentence. The new model is computationally more efficient, and more naturally suited to modeling global(More)
An adaptive statistical language model is described, which successfully integrates long distance linguistic information with other knowledge sources. Most existing statistical language models exploit only the immediate history of a text. To extract information from further back in the document's history, we propose and use trigger pairs as the basic(More)
In recent years speech recognition has become commercially viable on off-the-shelf computers—a goal that has long been sought by both the research community and by prospective users. Anyone who has used speech recognition technology understands that it has many flaws and much remains to be done. Uncertainty exists about how speech can and should be used, as(More)