Recovering Semantics of Tables on the Web

@article{Venetis2011RecoveringSO,
  title={Recovering Semantics of Tables on the Web},
  author={Petros Venetis and Alon Y. Halevy and Jayant Madhavan and Marius Pasca and Warren Shen and Fei Wu and Gengxin Miao and Chung Wu},
  journal={Proc. VLDB Endow.},
  year={2011},
  volume={4},
  pages={528-538}
}
The Web offers a corpus of over 100 million tables [6], but the meaning of each table is rarely explicit from the table itself. Header rows exist in few cases and even when they do, the attribute names are typically useless. We describe a system that attempts to recover the semantics of tables by enriching the table with additional annotations. Our annotations facilitate operations such as searching for tables and finding related tables. To recover semantics of tables, we leverage a database… 

Figures and Tables from this paper

Recovering the Semantics of Tabular Web Data
TLDR
This thesis conducts a comprehensive analysis of tables available on the Web to examine the characteristic features of these tables, but also identifies unique challenges that arise from these characteristics in the table understanding process.
Entity discovery and annotation in tables
TLDR
An algorithm is described that identifies the rows of a table that contain information on entities of specific types derived from an ontology and determines the cells in which the names of those entities occur and is trained to look for information on previously unseen entities on the Web so as to annotate them with the correct type.
Web table column categorisation and profiling
TLDR
A set of features which goes beyond probabilistic functional dependencies by using the union of multiple tables from the same web site and from different web sites to overcome the problem that single web tables are too small for the reliable calculation of functional dependencies.
Profiling the semantics of n-ary web table data
TLDR
This papers analyses a corpus of 5 million web tables originating from 80 thousand different web sites with respect to how many web table attributes are non-binary, what composite keys are required to correctly interpret the semantics of the non- binary attributes, and whether the values of these keys are found in the table itself or need to be extracted from the page surrounding the table.
From Web Tables to Concepts: A Semantic Normalization Approach
TLDR
This paper proposes a normalization approach to decompose multi-concept tables into smaller single- Concept tables, and utilizes the table schema as well as intrinsic data correlations to identify concept boundaries and split the tables accordingly.
Annotating Web Tables Using Surface Text Patterns
TLDR
This work develops a 2-stage framework where candidate patterns are generated based on sliding windows over texts in the first stage, and in the second stage, patterns are generalized and the redundant patterns are removed.
Towards Annotating Relational Data on the Web with Language Models
TLDR
The alternative described here does not require entity linking and relies instead on ranking relations using generative language models derived from Web-scale corpora, which can produce quality results even when the entities in the table are missing in the KG.
Annotating Web Tables through Knowledge Bases: A Context-Based Approach
TLDR
This paper presents two novel and unsupervised Web table annotation methods, which leverage the context of the tables to better capture their semantics and disambiguate their semantics.
TabEL: Entity Linking in Web Tables
TLDR
TabEL differs from previous work by weakening the assumption that the semantics of a table can be mapped to pre-defined types and relations found in the target KB, and enforces soft constraints in the form of a graphical model that assigns higher likelihood to sets of entities that tend to co-occur in Wikipedia documents and tables.
Web Table Annotation Using Knowledge Base
TLDR
This master thesis describes how to automatically link the structured information in Web tables to the Knowledge Base through the use of Lookup-based methods as a scalable solution and uses the word Embeddings to understand the semantics of the Web table and provide the corresponding row annotations.
...
...

References

SHOWING 1-10 OF 33 REFERENCES
Harvesting relational tables from lists on the web
TLDR
This work proposes a novel technique for extracting tables from lists that is domain independent and operates in a fully unsupervised manner, and believes that there are likely to be tens of millions of useful and query-able relational tables extractable from lists on the web.
Annotating and searching web tables using entities, types and relationships
TLDR
This paper proposes new machine learning techniques to annotate table cells with entities that they likely mention, table columns with types from which entities are drawn for cells in the column, and relations that pairs of table columns seek to express, and a new graphical model for making all these labeling decisions for each table simultaneously.
WebTables: exploring the power of tables on the web
TLDR
The WEBTABLES system develops new techniques for keyword search over a corpus of tables, and shows that they can achieve substantially higher relevance than solutions based on a traditional search engine.
Answering Table Augmentation Queries from Unstructured Lists on the Web
TLDR
Modifications to statistical record segmentation models are proposed, and novel consolidation and ranking techniques that can process input tables of arbitrary schema without requiring any human supervision are presented.
Uncovering the Relational Web
TLDR
This paper gives an in-depth study of the Web's HTML table corpus, and describes a system for performing relation recovery that achieves precision and recall that are comparable to other domain-independent information extraction systems.
Data Integration for the Relational Web
TLDR
Octopus is a system that combines search, extraction, data cleaning and integration, and enables users to create new data sets from those found on the Web, to offer the user a set of best-effort operators that automate the most labor-intensive tasks.
Open Information Extraction from the Web
TLDR
Open IE (OIE), a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input, is introduced.
The Tradeoffs Between Open and Traditional Relation Extraction
TLDR
A new model for Open IE called O-CRF is presented and it is shown that it achieves increased precision and nearly double the recall than the model employed by TEXTRUNNER, the previous stateof-the-art Open IE system.
Yago: a core of semantic knowledge
TLDR
YAGO builds on entities and relations and currently contains more than 1 million entities and 5 million facts, which includes the Is-A hierarchy as well as non-taxonomic relations between entities (such as HASONEPRIZE).
Automatic Set Instance Extraction using the Web
TLDR
This paper presents a system named ASIA (Automatic Set Instance Acquirer), which takes in the name of a semantic class as input and automatically outputs its instances and shows excellent performance on several English-language benchmarks, thus demonstrating language-independence.
...
...