Joint optimization of wrapper generation and template detection

@inproceedings{Zheng2007JointOO,
  title={Joint optimization of wrapper generation and template detection},
  author={Shuyi Zheng and Ruihua Song and Ji-rong Wen and Di Wu},
  booktitle={KDD '07},
  year={2007}
}
Many websites have large collections of pages generated dynamically from an underlying structured source like a database. The data of a category are typically encoded into similar pages by a common script or template. In recent years, some value-added services, such as comparison shopping and vertical search in a specific domain, have motivated the research of extraction technologies with high accuracy. Almost all previous works assume that input pages of a wrapper induction system conform to a… 

Figures and Tables from this paper

A Newfangled Template Extraction from Heterogeneous Web Pages using IEPAD
TLDR
A novel algorithm is proposed to improve the Efficiency, Accuracy and scalability of template extraction from heterogeneous web pages and the size of the cluster is considered based on the number of paths produced by the documents given as input.
What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template Extractors
TLDR
This work implemented and evaluated five of the most advanced template extractors in the literature and implemented a workbench, which can provide a fair empirical comparison of all methods using the same benchmarks, technology, implementation language, and evaluation criteria.
AUTOMATIC TEMPLATE EXTRACTION FROM HETEROGENEOUS WEB PAGES
TLDR
A novel goodness measure with its fast approximation for clustering and comprehensive analysis of the effectiveness and robustness of the algorithm compared to the state of the art for template detection algorithms are confirmed.
AUTOMATIC TEMPLATE EXTRACTION FROM HETEROGENEOUS WEB PAGES
TLDR
A novel goodness measure with its fast approximation for clustering and comprehensive analysis of the effectiveness and robustness of the algorithm compared to the state of the art for template detection algorithms are confirmed.
TEXT: Automatic Template Extraction from Heterogeneous Web Pages
TLDR
Novel algorithms for extracting templates from a large number of web documents which are generated from heterogeneous templates are presented and a novel goodness measure with its fast approximation for clustering and comprehensive analysis of the algorithm are provided.
Automatic Template Extraction using Hyper Graph Technique from Heterogeneous Web Pages
TLDR
Novel algorithms for extracting templates from a large number of web documents which are generated from heterogeneous templates are presented and a novel goodness measure with its fast approximation for clustering and comprehensive analysis of the algorithm are developed.
A study on template extraction
  • S. Pushpa, D. Kanagalatchumy
  • Computer Science
    2013 International Conference on Information Communication and Embedded Systems (ICICES)
  • 2013
TLDR
This paper surveys some of the algorithms for extracting templates from different web pages in an efficient manner and presents different techniques for the fast and accurate performances in extracting templates.
TEMPLATE EXTRACTION FROM HETEROGENEOUS WEB PAGES
TLDR
To effectively manage an unknown number of clusters (templates) Minimum Description Length (MDL) Principle is use and MinHash technique to estimate the MDL cost quickly so that it will form a qualified cluster.
Template Independent Object extraction using MDL and MinHash Techniques
TLDR
SAX parser can provides the good partition technique compare to DOM parser representation performance and scalability with time consuming process, and can show the performance of templates detection process with clustering process.
Template Independent Object extraction using MDL and MinHash Techniques
TLDR
SAX parser can provides the good partition technique compare to DOM parser representation performance and scalability with time consuming process, and can show the performance of templates detection process with clustering process.
...
...

References

SHOWING 1-10 OF 26 REFERENCES
Fully automatic wrapper generation for search engines
TLDR
A technique for automatically producing wrappers that can be used to extract search result records from dynamically generated result pages returned by search engines, and experimental results indicate that this technique can achieve very high extraction accuracy.
Extracting structured data from Web pages
TLDR
This paper presents an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages.
Wrapping-oriented classification of web pages
TLDR
Given a portion of a Web site to wrap, techniques to cluster its HTML pages into page classes with homogeneous organization and layout can become the input to the wrapper generation process.
Interactive wrapper generation with minimal user effort
TLDR
The goal is to minimize the amount of user effort for training reliable wrappers through design of a suitable training interface that is implemented based on a powerful underlying extraction language and a set of training and ranking algorithms.
A hierarchical approach to wrapper induction
TLDR
This work introduces an inductive algorithm, STALKER, that generates high accuracy extraction rules based on user-labeled training examples that can handle information sources that could not be wrapped by existing techniques.
Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web
IEPAD: information extraction based on pattern discovery
TLDR
IEPAD is proposed, a system that automatically discovers extraction rules from Web pages that can automatically identify record boundary by repeated pattern mining and multiple sequence alignment and can achieve 97 percent extraction over fourteen popular search engines.
A brief survey of web data extraction tools
TLDR
A taxonomy for characterizing Web data extraction fools is proposed, a survey of major web data extraction tools described in the literature is briefly surveyed, and a qualitative analysis of them is provided.
A flexible learning system for wrapping tables and lists in HTML documents
TLDR
A wrapper-learning system called WL2 that can exploit several different representations of a document, including DOM-level and token-level representations, as well as two-dimensional geometric views of the rendered page and representations of the visual appearance of text asm it will be rendered.
A two-phase rule generation and optimization approach for wrapper generation
TLDR
A novel two-phase rule generation and optimization (2P-RULE) approach for wrapper generation that is suitable for extracting information from web pages with complex nested structure, and can also achieve better precision and recall ratio.
...
...