• Corpus ID: 9518100

Extracting the Main Content from HTML Documents

  title={Extracting the Main Content from HTML Documents},
  author={Samuel Louvan},
A modern web document typically consists of many kinds of information. Besides the main content which conveys the primary information, a web document also contains noisy contents such as advertisements, headers, footers, decorations, copyright information, navigation menus etc. The presence of noisy contents may affect the performance of applications such as commercial search engines, web crawlers, and web miners. Therefore, extracting main contents from web document and removing noisy contents… 

Figures and Tables from this paper

To extract informative content from online web pages by using hybrid approach

  • Madhura R. KadduR. Kulkarni
  • Computer Science
    2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT)
  • 2016
This work presents a hybrid approach for extracting main content from the web pages based on combination of automatic extraction and hand crafted rules techniques, in which automatic rules are created instead of manual hand crafted rule creation.

Web Page Segmentation and Informative Content Extraction for Effective Information Retrieval

Effective Visual Block Extractor (EVBE) Algorithm is proposed to overcome the problems of DOM-based Approaches and reduce the drawbacks of previous works in Web Page Segmentation and effective extraction results of the Proposed Algorithm, higher Precision and Recall can help for increasing the performance of Web Mining tasks.

A Survey of Web Information Extraction Tools

A comprehensive review of the major Web IE tools that used for Web text and based on Document Object Model for representing the web pages is provided to decide which suitable Web IE tool will be integrated in the future work in Web Text Mining.

Informative Content Extraction By Using Eifce [Effective Informative Content Extractor]

This paper proposes Effective Visual Block Extractor (EVBE) Algorithm to overcome the problems of DOM-based Approaches and reduce the drawbacks of previous works in Web Page Segmentation.

Web Content Extraction by Weighing the Fundamental Contextual Rules

This research presents a new approach to extract useful content from the Web by weighing the fundamental contextual rules method to DOM Tree's nodes and selects the best child node of the parent node according to a weighing algorithm.

The Development of Medicinal Plants Database for Use in Primary Health Care from Various Herbal Websites

The result from extracting data showed the amount of precision, recall and F-measure to be more than 95% and provided information about plant data which was extracted from those websites into the relational database.



Language independent content extraction from web pages

This method creates a text density graph of a given Web page and then selects the region of the Web page with the highest density that is comparable or better than state-of-the-art methods that are computationally more complex, when evaluated on a standard dataset.

DOM-based content extraction of HTML documents

This work has developed a framework that employs easily extensible set of techniques that incorporate advantages of previous work on content extraction, and the key insight is to work with the DOM trees, rather than with raw HTML markup.

Automating Content Extraction of HTML Documents

A framework that employs an easily extensible set of techniques to work with DOM trees, a W3C specified interface that allows programs to dynamically access document structure, rather than with raw HTML markup is developed.

Automatic extraction of informative blocks from webpages

Two new algorithms, ContentExtractor and FeatureExtractor, are proposed that identify primary content blocks with high precision and recall, reduce the storage requirement for search engines, result in smaller indexes and thereby faster search times, and better user satisfaction.

Discovering informative content blocks from Web documents

By adopting InfoDiscoverer as the preprocessor of information retrieval and extraction applications, the retrieval and extracting precision will be increased, and the indexing size and extracting complexity will also be reduced.

Text Extraction from the Web via Text-to-Tag Ratio

  • Tim WeningerW. Hsu
  • Computer Science
    2008 19th International Workshop on Database and Expert Systems Applications
  • 2008
This work describes how to compute the text-to-tag ratio on a line-by-line basis and then cluster the results into content and non-content areas and shows surprisingly high levels of recall for all levels of precision, and a large space savings.

Extracting context to improve accuracy for HTML content extraction

A new technique is presented, based on incrementally clustering websites using search engine snippets, to associate a newly requested website with a particular "genre", and then employ settings previously determined to be appropriate for that genre, with dramatically improved content extraction results overall.

Editorial: special issue on web content mining

This special issue focuses on Web content mining, which consists of Web usage mining, Web structure mining, and Web contentmining, which aims to extract/mine useful information or knowledge from Web page contents.

Learning block importance models for web pages

This paper uses a vision-based page segmentation algorithm to partition a web page into semantic blocks with a hierarchical structure, then spatial features and content features are extracted and used to construct a feature vector for each block.

QuASM: a system for question answering using semi-structured data

A system for question answering using semi-structured metadata, QuASM (pronounced "chasm"), which aims to answer factual questions by exploiting the structure inherent in documents found on the World Wide Web.