Corpus ID: 11589623

A Benchmark Suite for Template Detection and Content Extraction

  title={A Benchmark Suite for Template Detection and Content Extraction},
  author={Juli{\'a}n Alarte and David Insa and J. Silva and S. Tamarit},
  • Julián Alarte, David Insa, +1 author S. Tamarit
  • Published 2014
  • Computer Science
  • ArXiv
  • Template detection and content extraction are two of the main areas of information retrieval applied to the Web. They perform different analyses over the structure and content of webpages to extract some part of the document. However, their objective is different. While template detection identifies the template of a webpage (usually comparing with other webpages of the same website), content extraction identifies the main content of the webpage discarding the other part. Therefore, they are… CONTINUE READING
    3 Citations

    Tables and Topics from this paper.

    Web content extraction based on maximum continuous sum of text density
    • Kai Sun, M. Li, +4 authors S. Fu
    • Computer Science
    • 2016 International Conference on Asian Language Processing (IALP)
    • 2016
    • 1
    TeMex: The Web Template Extractor
    • 3
    • PDF
    An Effective Method to Extract Web Content Information


    Using the words/leafs ratio in the DOM tree for content extraction
    • 19
    • PDF
    Template Extraction Based on Menu Information
    • Proceedings of the 9th International Workshop on Automated Specification and Verification of Web Systems (WWV 13), page Article 5
    • 2013