Extraction of Product Specifications from the Web - Going Beyond Tables and Lists

@article{Gangadhar2022ExtractionOP,
  title={Extraction of Product Specifications from the Web - Going Beyond Tables and Lists},
  author={Govind Krishnan Gangadhar and Ashish Kulkarni},
  journal={5th Joint International Conference on Data Science \& Management of Data (9th ACM IKDD CODS and 27th COMAD)},
  year={2022}
}
E-commerce product pages on the web often present product specification data in structured tabular blocks. Extraction of these product attribute-value specifications has benefited applications like product catalogue curation, search, question answering, and others. However, across different Websites, there is a wide variety of HTML elements (like , , , , etc.) typically used to render these blocks that makes their automatic extraction a challenge. Most of the current research has focused on… 

An Efficient Mechanism for Deep Web Data Extraction Based on Tree-Structured Web Pattern Matching

For effective web data extraction with a large number of online pages, a unique representation of page generation using tree-based pattern matches (TBPM) is proposed and the performance of the proposed technique TBPM is compared to that of existing techniques in terms of relativity, precision, recall, and time consumption.

References

SHOWING 1-10 OF 25 REFERENCES

Extracting attribute-value pairs from product specifications on the web

An approach for extracting attribute-value pairs from product specifications on the Web using supervised learning to classify the HTML tables and HTML lists within a web page as product specification or not, and reports the results of using duplicate-based schema matching to align the product attribute schemata of 32 different e-shops.

DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web

The web is a rich resource of structured data. There has been an increasing interest in using web structured data for many applications such as data integration, web search and question answering. In

An Adaptive Faceted Search Interface for Structured Product Offers on the Web

This paper presents an adaptive faceted search interface over product offers in RDF that does not rely on a rigid conceptual schema with hardwired product features, thereby being suitable for arbitrary product domains and product evolution.

Synthesizing Products for Online Catalogs

This paper proposes a system that provides an end-to-end solution to the product synthesis problem, and addresses issues involved in data extraction from offers, schema reconciliation, and data fusion.

Matching unstructured product offers to structured product specifications

The heart of this system is a data-driven component that learns the matching function off-line, which is then applied at run-time for matching offers to products, used to match all the offers received by Bing Shopping to the Bing product catalog.

Towards domain-independent information extraction from web tables

This paper shifts attention from the tree-based representation of webpages to a variation of the two-dimensional visual box model used by web browsers to display the information on the screen and believes that this approach can become the basis for a new way of large-scale knowledge acquisition from the current "Visual Web".

Learning to Extract Attribute Value from Product via Question Answering: A Multi-task Approach

This work proposes a novel approach for Attribute Value Extraction via Question Answering (AVEQA) using a multi-task framework which treats each attribute as a question and identifies the answer span corresponding to the attribute value in the product context.

Uncovering the Relational Web

This paper gives an in-depth study of the Web's HTML table corpus, and describes a system for performing relation recovery that achieves precision and recall that are comparable to other domain-independent information extraction systems.

WebTables: exploring the power of tables on the web

The WEBTABLES system develops new techniques for keyword search over a corpus of tables, and shows that they can achieve substantially higher relevance than solutions based on a traditional search engine.

Building an operational product ontology system