DOM-based content extraction of HTML documents

@inproceedings{Gupta2003DOMbasedCE,
  title={DOM-based content extraction of HTML documents},
  author={Suhit Gupta and Gail E. Kaiser and David Neistadt and Peter Grimm},
  booktitle={WWW},
  year={2003}
}
Web pages often contain clutter (such as pop-up ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, including cell phone and PDA browsing, speech rendering for the visually impaired, and text summarization. Most approaches to removing clutter or making content more readable involve changing font size or removing HTML and data components such as… CONTINUE READING
Highly Influential
This paper has highly influenced 12 other papers. REVIEW HIGHLY INFLUENTIAL CITATIONS
Highly Cited
This paper has 324 citations. REVIEW CITATIONS

8 Figures & Tables

Topics

Statistics

02040'04'06'08'10'12'14'16'18
Citations per Year

324 Citations

Semantic Scholar estimates that this publication has 324 citations based on the available data.

See our FAQ for additional information.