Another Look at the Data Sparsity Problem

  title={Another Look at the Data Sparsity Problem},
  author={Ben Allison and David Guthrie and Louise Guthrie},
Performance on a statistical language processing task relies upon accurate information being found in a corpus. However, it is known (and this paper will confirm) that many perfectly valid word sequences do not appear in training corpora. The percentage of n-grams in a test document which are seen in a training corpus is defined as n-gram coverage, and work in the speech processing community [7] has shown that there is a correlation between n-gram coverage and word error rate (WER) on a speech… CONTINUE READING
Highly Cited
This paper has 22 citations. REVIEW CITATIONS

From This Paper

Figures, tables, and topics from this paper.


Publications referenced by this paper.
Showing 1-8 of 8 references

There’s No Data Like More Data (But When Will Enough Be Enough?)

  • R. Moore
  • In Proceedings of IEEE International Workshop on…
  • 2001
1 Excerpt

Up from trigrams

  • F. Jelinek
  • In Proceedings Eurospeech
  • 1991
1 Excerpt

The Anarchist’s Cookbook

  • W. Powell
  • Ozark Pr Llc
  • 1970
1 Excerpt