Scaling to Very Very Large Corpora for Natural Language Disambiguation

  title={Scaling to Very Very Large Corpora for Natural Language Disambiguation},
  author={Michele Banko and Eric Brill},
The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than… CONTINUE READING
Highly Influential
This paper has highly influenced 26 other papers. REVIEW HIGHLY INFLUENTIAL CITATIONS
Highly Cited
This paper has 489 citations. REVIEW CITATIONS


Publications citing this paper.

490 Citations

Citations per Year
Semantic Scholar estimates that this publication has 490 citations based on the available data.

See our FAQ for additional information.


Publications referenced by this paper.
Showing 1-10 of 22 references

A simple approach to building ensembles of naive bayesian classifiers for word sense disambiguation

  • T. Pedersen
  • In Proceedings of the First
  • 2000

The role of unlabeled data in supervised learning

  • T. M. Mitchell
  • Proceedings of the Sixth International Colloquium…
  • 1999
1 Excerpt

Similar Papers

Loading similar papers…