• Corpus ID: 235422653

GitTables: A Large-Scale Corpus of Relational Tables

@article{Hulsebos2021GitTablesAL,
  title={GitTables: A Large-Scale Corpus of Relational Tables},
  author={Madelon Hulsebos and cCaugatay Demiralp and Paul Groth},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.07258}
}
The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, lim-iting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need resources with tables that resemble relational database tables. Here we introduce GitTables… 

Figures and Tables from this paper

DAGOBAH: Table and Graph Contexts For Efficient Semantic Annotation Of Tabular Data
TLDR
The latest improvements of the DAGOBAH system that performs automatic pre-processing and semantic interpretation of tables are presented and promising results obtained in the SemTab 2021 challenge are reported.
Table Understanding in Practice
TLDR
It is observed that actual deployment of table understanding models beyond this context is still challenging for various reasons, and performance within an isolated context is illustrated.
Making Table Understanding Work in Practice
TLDR
This paper discusses three challenges of deploying table understanding models and proposes a framework to address them, and presents SIGMATYPER which implements this framework for the semantic column type detection task.
GitSchemas: A Dataset for Automating Relational Data Preparation Tasks
TLDR
This paper created a new dataset aimed at increasing the level of automation of data preparation tasks for relational data, called GITSCHEMAS, which consists of schema metadata for almost 50k real-world databases, collected from public GitHub repositories.
SiMa: Effective and Efficient Data Silo Federation Using Graph Neural Networks
TLDR
SiMa is presented, a method for federating data silos that consistently finds more correct relationships than the state-of-the-art matching methods, while minimizing wrong predictions and requiring 20x to 1000x less time to execute.
Similarity-driven Schema Transformation for Test Data Generation
TLDR
A novel approach for similarity-driven generation of schemas is presented, which takes as input an arbitrary dataset, extracts its schema, and derives a set of output schemas from it, and utilizes a novel method that generates multiple schemas based on user-defined heterogeneity constraints making the generation process configurable even for non-experts.
Table Pre-training: A Survey on Model Architectures, Pretraining Objectives, and Downstream Tasks
TLDR
This survey aims to provide a comprehensive review of model designs, pre-training objectives, and downstream tasks for table pre- training, and to share the thoughts on existing challenges and future opportunities.
Results of SemTab 2021
SemTab 2021 was the third edition of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, successfully collocated with the 20th International Semantic Web Conference (ISWC) and the
JenTab Meets SemTab 2021's New Challenges
TLDR
This paper re-designed the system architecture, optimized individual modules, and developed various pipelines to target specific challenges posed throughout the SemTab2021 challenge, demonstrating JenTab’s flexibility and its ability to quickly address new challenges.

References

SHOWING 1-10 OF 68 REFERENCES
VizNet: Towards A Large-Scale Visualization Learning and Benchmarking Repository
TLDR
VizNet is presented: a large-scale corpus of over 31 million datasets compiled from open data repositories and online visualization galleries that provides the necessary common baseline for comparing visualization design techniques, and developing benchmark models and algorithms for automating visual analysis.
Characteristics of Open Data CSV Files
TLDR
This work analyzes an Open Data corpus containing 200K tabular resources with a total file size of 413 GB from a data consumer perspective and inspects the general shape of these tabular data, reports on column and row distribution, and analyses the availability of header rows and if a file contains multiple tables.
Methods for exploring and mining tables on Wikipedia
TLDR
This work presents WikiTables, a Web application that enables users to interactively explore tabular knowledge extracted from Wikipedia that substantially outperforms baselines on the novel task of automatically joining together disparate tables to uncover "interesting" relationships between table columns.
Sherlock: A Deep Learning Approach to Semantic Data Type Detection
TLDR
Sherlock is introduced, a multi-input deep neural network for detecting semantic types that achieves a support-weighted F$_1 score of $0.89, exceeding that of machine learning baselines, dictionary and regular expression benchmarks, and the consensus of crowdsourced annotations.
TURL: Table Understanding through Representation Learning
TLDR
This paper proposes a structure-aware Transformer encoder to model the row-column structure of relational tables, and presents a new Masked Entity Recovery objective for pre-training to capture the semantics and knowledge in large-scale unlabeled data.
Universal Sentence Encoder for English
TLDR
Transfer learning using sentence-level embeddings is shown to outperform models without transfer learning and often those that use only word-level transfer.
Web table extraction
  • retrieval, and augmentation: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 11
  • 2020
Results of SemTab 2021
SemTab 2021 was the third edition of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, successfully collocated with the 20th International Semantic Web Conference (ISWC) and the
‘G’
  • P. Alam
  • Composites Engineering: An A–Z Guide
  • 2021
Knowledge Graphs 2021: A Data Odyssey
  • G. Weikum
  • Computer Science
    Proc. VLDB Endow.
  • 2021
TLDR
The role of "DB thinking" in building and maintaining high-quality knowledge bases from web contents and extracting quantitative measures of entities, from text and web tables, presents an opportunity to further enhance the scope and value of knowledge bases.
...
...