CSE 254 ( Spring 2003 ) “ Growing N - gram Trees for Language Modeling ”

Abstract

We implement variable n-grams using a word-tree data structure, where nodes represent sequences of words (“contexts”) and store how often that context appeared in a training corpus. We build the tree by growing it from the root up. Unlike other methods, there is no pruning step. Instead, we used the simple heuristic of maintaining a priority-queue of candidate leaves, sorted by how often those contexts occurred in the training text. The most popular leaves are then added to the tree, and this process repeats until a specified memory limit is reached. In this way, the tree was able to make branches for longer sentence fragments like ‘‘across the street from the’’ while saving the space from storing uncommon ones. We tested our system on samples from the North American News Text. Training/testing perplexities were comparable to that of a standard trigram model, and performed better in some cases.

5 Figures and Tables

Cite this paper

@inproceedings{Boswell2004CSE2, title={CSE 254 ( Spring 2003 ) “ Growing N - gram Trees for Language Modeling ”}, author={D. R. Boswell}, year={2004} }