Corpus ID: 11589623

A Benchmark Suite for Template Detection and Content Extraction

  title={A Benchmark Suite for Template Detection and Content Extraction},
  author={Juli{\'a}n Alarte and David Insa and J. Silva and S. Tamarit},
  • Julián Alarte, David Insa, +1 author S. Tamarit
  • Published 2014
  • Computer Science
  • ArXiv
  • Template detection and content extraction are two of the main areas of information retrieval applied to the Web. They perform different analyses over the structure and content of webpages to extract some part of the document. However, their objective is different. While template detection identifies the template of a webpage (usually comparing with other webpages of the same website), content extraction identifies the main content of the webpage discarding the other part. Therefore, they are… CONTINUE READING

    Tables and Topics from this paper.

    Web content extraction based on maximum continuous sum of text density
    • Kai Sun, M. Li, +4 authors S. Fu
    • Computer Science
    • 2016
    • 1
    An Effective Method to Extract Web Content Information


    Publications referenced by this paper.
    Using the words/leafs ratio in the DOM tree for content extraction
    • 19
    • PDF
    Template Extraction Based on Menu Information
    • 2013