• Corpus ID: 38160065

A Proof-of-Concept of D³ Record Mining using Domain-Dependent Data

  title={A Proof-of-Concept of D³ Record Mining using Domain-Dependent Data},
  author={Yeong Su Lee and Michaela Geierhos and Sa-Kwang Song and Hanmin Jung},
Our purpose is to perform data record extraction from onlineevent calendars exploiting sublanguage and domain characteristics. [...] Key Method One of the most remarkable advantages of our method is that it does not require any additional classification steps based on machine learning algorithms or keyword extraction methods; it is a so-called one-step mining technique. Moreover, another important criteria is that our system is robust to DOM and layout modifications made by web designers. Thus, preliminary…Expand


Efficient record-level wrapper induction
This work proposes a record-level wrapper system that uses a novel ``broom'' structure to represent both records and generated wrappers and is able to effectively extract records and identify their internal semantics at the same time.
Web data extraction based on partial tree alignment
Experimental results using a large number of Web pages from diverse domains show that the proposed two-step technique is able to segment data records, align and extract data from them very accurately.
Vision-based Web Data Records Extraction
This paper proposes a novel and language independent technique to solve the data extraction problem of extracting data records on the response pages returned from web databases or search engines and results indicate that this visionbased approach can achieve very high extraction accuracy.
Mining Data Regions from Web Pages
This paper proposes a much more effective automatic technique that is able to mine the non-contiguous data records and can correctly identify data regions, irrespective of the type of tag in which it is bound.
Mining data records in Web pages
The experimental results show that the proposed technique outperforms existing techniques substantially, and is able to mine both contiguous and non-contiguous data records.
Visual Clue Based Extraction of Web Data from Flat and Nested Data Records
This paper proposes a more novel and effective technique for the extraction of data items from the nested and flat records of given web pages and shows that it is effective and better than existing techniques.
Extracting structured data from Web pages
This paper presents an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages.
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
A novel technique to compare HTML pages and generate a wrapper based on their similarities and dierences is developed, which confirms the feasibility of the approach on real-life data-intensive Web sites.
Extracting Content Structure for Web Pages Based on Visual Representation
This paper presents an automatic top-down, tag-tree independent approach to detect web content structure that simulates how a user understands web layout structure based on his visual perception.
Efficient approaches for record level web information extraction systems
  • International Journal of Advanced Engineering & Application 2(1)
  • 2011