Statistical language model based on a hierarchical approach: MCnv

Abstract

In this paper, we propose a new language model based on dependent word sequences organized in a multi-level hierarchy. We call this model MC n, where n is the maximum number of words in a sequence and is the maximum number of levels. The originality of this model is its capacity to take into account dependent variable-length sequences for very large vocabularies. In order to discover the variable-length sequences and to build the hierarchy, we use a set of 233 syntactic classes extracted from the 8 French elementary grammatical classes. The MC n model learns hierarchical word patterns and uses them to reevaluate and filter the n-best utterance hypotheses outputted by our speech recognizer MAUD. The model has been trained on a corpus of 43 million words extracted from a French newspaper and uses a vocabulary of 20000 words. Tests have been conducted on 300 sentences. Results achieved 17% decrease in perplexity compared to an interpolated class trigram model. Rescoring the original n-best hypotheses resulted in an improvement of 5% in accuracy.

Extracted Key Phrases

4 Figures and Tables

Cite this paper

@inproceedings{Zitouni2001StatisticalLM, title={Statistical language model based on a hierarchical approach: MCnv}, author={Imed Zitouni and Kamel Sma{\"{i}li and Jean Paul Haton}, booktitle={INTERSPEECH}, year={2001} }