Ten Years of WebTables

@article{Cafarella2018TenYO,
  title={Ten Years of WebTables},
  author={Michael J. Cafarella and Alon Y. Halevy and Hongrae Lee and Jayant Madhavan and Cong Yu and Daisy Zhe Wang and Eugene Wu},
  journal={Proc. VLDB Endow.},
  year={2018},
  volume={11},
  pages={2140-2149}
}
In 2008, we wrote about WebTables, an effort to exploit the large and diverse set of structured databases casually published online in the form of HTML tables. The past decade has seen a flurry of research and commercial activities around the WebTables project itself, as well as the broad topic of informal online structured data. In this paper, we will review the WebTables project, and try to place it in the broader context of the decade of work that followed. We will also show how the progress… 

Figures from this paper

Structured Object Matching across Web Page Revisions
TLDR
This work presents novel techniques that match tables, infoboxes and lists within a page across page revisions and is able to extract the evolution of structured information in various forms from a long series of web page revisions.
Dataset search: a survey
TLDR
This work surveys the state of the art of research and commercial systems and discusses what makes dataset search a field in its own right, with unique challenges and open questions, and looks at approaches and implementations from related areas dataset search is drawing upon.
A Novel Approach to Data Extraction on Hyperlinked Webpages
TLDR
15,000 web pages were downloaded using the in-house developed web-crawler and a nondeterministic finite automaton algorithm was designed to identify simple, complex, hyperlinked, or non-linked tables that could assist with performing better and stronger queries using the join operation.
The Secret Life of Wikipedia Tables
TLDR
This empirical paper has extracted, matched and analyzed the entire history of all 3.5M tables on the English Wikipedia for a total of 53.8M table versions to provide various analysis results, such as statistics about lineage sizes, table positions, volatility, change intervals, schema changes, and their editors.
CWTs: A Public Large-Scale Chinese Web Table Data Set
TLDR
The results show that the data set obtained can provide a preliminary data basis for table data query, question answering system, and knowledge base construction, at the same time provide an optional channel for building a large-scale machine-understandable Chinese knowledge base.
From web-tables to a knowledge graph: prospects of an end-to-end solution
TLDR
Unlike general-purpose text mining and web-scraping tools, this work aims at developing a solution that takes into account the relational nature of the information represented in web-tables.
Natural Key Discovery in Wikipedia Tables
TLDR
This work formally defines the notion of natural keys and proposes a supervised learning approach to automatically detect natural keys in Wikipedia tables using carefully engineered features, which achieves 80% F-measure, which is at least 20% more than all related approaches.
Collocating News Articles with Structured Web Tables✱
TLDR
This paper uses the content and entities extracted from news articles and their matching tables to fine-tune a Bidirectional Transformers (BERT) model, and achieves near 90% accuracy@5 as opposed to baselines varying between 30% and 64%.
Extracting Contextualized Quantity Facts from Web Tables
TLDR
A novel method for automatically extracting quantity facts from ad-hoc web tables by recognizing quantities, with normalized values and units, aligning them with the proper entities, and contextualizing these pairs with informative cues to match sophisticated queries with modifiers.
Web-scale Knowledge Collection
TLDR
This tutorial presents approaches for Information Extraction from Web data that can be differentiated along two key dimensions: the diversity in data modality that is leveraged, e.g. text, visual, XML/HTML, and the thrust to develop scalable approaches with zero to limited human supervision.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 42 REFERENCES
Applying WebTables in Practice
TLDR
The main challenges faced in identifying tables that are likely to contain high-quality data and recovering the semantics of these tables or signals that hint at their semantics were identified and the result is a semantically enriched table corpus that was used to develop several services.
WebTables: exploring the power of tables on the web
TLDR
The WEBTABLES system develops new techniques for keyword search over a corpus of tables, and shows that they can achieve substantially higher relevance than solutions based on a traditional search engine.
Uncovering the Relational Web
TLDR
This paper gives an in-depth study of the Web's HTML table corpus, and describes a system for performing relation recovery that achieves precision and recall that are comparable to other domain-independent information extraction systems.
Data Integration for the Relational Web
TLDR
Octopus is a system that combines search, extraction, data cleaning and integration, and enables users to create new data sets from those found on the Web, to offer the user a set of best-effort operators that automate the most labor-intensive tasks.
Exhibit: lightweight structured data publishing
TLDR
Exhibit is a lightweight framework for publishing structured data on standard web servers that requires no installation, database administration, or programming and makes that data more useful to all of its consumers.
Answering Table Queries on the Web using Column Keywords
TLDR
The design of a structured search engine which returns a multi-column table in response to a query consisting of keywords describing each of its columns is presented and a novel query segmentation model for matching keywords to table columns is defined.
Synthesizing Union Tables from the Web
TLDR
This paper defines the notion of stitchable tables and identifies collections of tables that can be stitched, and designs an effective algorithm for extracting hidden attributes that are essential for the stitching process and for aligning values of those attributes across tables to synthesize new columns.
A Large Public Corpus of Web Tables containing Time and Context Metadata
TLDR
A large public corpus of Web tables which contains over 233 million tables and has been extracted from the July 2015 version of the CommonCrawl is presented to provide a common ground for evaluating Web table systems.
Functional Dependency Generation and Applications in Pay-As-You-Go Data Integration Systems
TLDR
This work proposes a framework based on functional dependencies (FDs) that generates FDs by counting-based algorithms over many data sources, and extends the FDs with probabilities to capture the inherent uncertainties in them and solves two problems to improve data and schema quality in a pay-as-you-go system.
TEGRA: Table Extraction by Global Record Alignment
TLDR
This work addresses the important problem of automatically extracting multi-column relational tables from such lists in a ``list'' form, and develops an efficient 2-approximation algorithm that considerably outperforms the state-of-the-art approaches in terms of quality.
...
1
2
3
4
5
...