A Lightweight and Efficient Tool for Cleaning Web Pages

  title={A Lightweight and Efficient Tool for Cleaning Web Pages},
  author={Stefan Evert},
Originally conceived as a “naive” baseline experiment using traditional n-gram language models as classifiers, the NCLEANER system has turned out to be a fast and lightweight tool for cleaning Web pages with state-of-the-art accuracy (based on results from the CLEANEVAL competition held in 2007). Despite its simplicity, the algorithm achieves a significant improvement over the baseline (i.e. plain, uncleaned text dumps), trading off recall for substantially higher precision. NCLEANER is… CONTINUE READING
Highly Cited
This paper has 41 citations. REVIEW CITATIONS

From This Paper

Topics from this paper.
26 Citations
5 References
Similar Papers


Publications citing this paper.
Showing 1-10 of 26 extracted citations


Publications referenced by this paper.
Showing 1-5 of 5 references

Wacky! Working papers on the Web as Corpus. GEDIT, Bologna. Online version: http://wackybook. sslmit.unibo.it

  • Marco Baroni, Silvia Bernardini, editors
  • 2006

Similar Papers

Loading similar papers…