Information Extraction from the Web by Matching Visual Presentation Patterns

@inproceedings{Burget2016InformationEF,
  title={Information Extraction from the Web by Matching Visual Presentation Patterns},
  author={Radek Burget},
  booktitle={KEKI/NLP\&DBpedia@ISWC},
  year={2016}
}
  • Radek Burget
  • Published in KEKI/NLP&DBpedia@ISWC 17 October 2016
  • Computer Science
The documents available in the World Wide Web contain large amounts of information presented in tables, lists or other visually regular structures. The published information is however usually not annotated explicitly or implicitly and its interpretation is left on a human reader. This makes the information extraction from web documents a challenging problem. Most existing approaches are based on a top-down approach that proceeds from the larger page regions to individual data records, which… 
Model-based Integration of Unstructured Web Data Sources using Graph Representation of Document Contents
TLDR
This paper proposes a graph-based model of the input document that allows to interpret the contained data in different alternative ways and proposes a method of aligning the document model with the target domain model by evaluating all possible mappings between the two models.
On the synthesis of metadata tags for HTML files
TLDR
The design of a system that overcomes the previous limitations using a novel embedding approach that has proven to outperform four state‐of‐the‐art techniques on a repository with randomly selected HTML files from 40 different sites is described.
Segmentation of Dashboard Screen Images: Preparation of Inputs for Object-based Metrics of UI Quality
TLDR
This work analyzed the experience of 251 users manually segmenting dashboard screens and designed a novel method for the automatic segmentation of dashboard screen images, which processes the screen layout using the combination of the top-down and bottom-up segmentation strategy.

References

SHOWING 1-10 OF 15 REFERENCES
Information Extraction from Web Sources Based on Multi-aspect Content Analysis
TLDR
This paper presents an information extraction approach based on analyzing the rendered pages rather than their code, represented by an RDF-based model that allows to combine the results of different page analysis methods such as layout analysis and the visual and textual feature classification.
Visually Extracting Data Records from Query Result Pages
TLDR
This work proposes a novel approach, in which it makes use of the common sources of evidence that humans use to understand data records on a displayed query result page, and proposes new techniques that can identify each data record individually, while ignoring noise items, such as navigation bars and adverts.
Extracting Data Records from Query Result Pages Based on Visual Features
TLDR
This work proposes a novel approach that makes use of visual features and query terms to identify the data section and extract data records from it and uses several content and visual features of visual blocks in a data section to filter out noisy blocks.
Automatic annotation of online articles based on visual feature classification
TLDR
This paper presents a method of interesting area detection in a web page, inspired by an assumed human reader approach, and proposes a way of the block classification based on the visual features and provides an experimental evaluation of the method on real-world data.
Hierarchies in HTML documents: linking text to concepts
  • Radek Burget
  • Computer Science
    Proceedings. 15th International Workshop on Database and Expert Systems Applications, 2004.
  • 2004
TLDR
An approach based on modeling the visual part of the rendered document and describing the key characteristics of the data presentation in a general way is proposed, and the way for using this model for locating the instances of the concepts in the document using the approximate tree matching algorithms and regular expressions is proposed.
Information extraction for search engines using fast heuristic techniques
Data extraction from web pages based on structural-semantic entropy
TLDR
An automated information extraction algorithm that can extract the relevant attribute-value pairs from product descriptions across different sites is presented, which has been successfully applied to detect false drug advertisements on the web due to its capacity in associating the attributes of records with their respective values.
Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages
Region based data extraction
TLDR
This paper proposes a novel visual based wrapper that uses visual cue to eliminate unnecessary regions, hence reduces the running time of extraction task as the wrapper only needs to consider the relevant region for extraction.
Automatically Annotating Structured Web Data Using a SVM-Based Multiclass Classifier
TLDR
A new learning approach to Web data annotation is proposed, where a support vector machine-based multiclass classifier is trained to assign labels to data items and a data section re-segmentation algorithm based on visual and content features is introduced.
...
...