Algorithms for bigram and trigram word clustering 1

Abstract

In this paper, we describe an efficient method for obtaining word classes for class language models. The method employs an exchange algorithm using the criterion of perplexity improvement. The novel contributions of this paper are the extension of the class bigram perplexity criterion to the class trigram perplexity criterion, the description of an efficient implementation for speeding up the clustering process, the detailed computational complexity analysis of the clustering algorithm, and, finally, experimental results on large text corpora of about 1, 4, 39 and 241 million words including examples of word classes, test corpus perplexities in comparison to word language models, and speech recognition results. q 1998 Elsevier Science B.V. All rights reserved.

19 Figures and Tables

Cite this paper

@inproceedings{Martin1998AlgorithmsFB, title={Algorithms for bigram and trigram word clustering 1}, author={Sven C. Martin and Jorg Liermann and Hermann Ney}, year={1998} }