TABBIE: Pretrained Representations of Tabular Data

@inproceedings{Iida2021TABBIEPR,
  title={TABBIE: Pretrained Representations of Tabular Data},
  author={Hiroshi Iida and Dung Ngoc Thai and Varun Manjunatha and Mohit Iyyer},
  booktitle={NAACL},
  year={2021}
}
Existing work on tabular representation-learning jointly models tables and associated text using self-supervised objective functions derived from pretrained language models such as BERT. While this joint pretraining improves tasks involving paired tables and text (e.g., answering questions about tables), we show that it underperforms on tasks that operate over tables without any associated text (e.g., populating missing cells). We devise a simple pretraining objective (corrupt cell detection… 

Figures and Tables from this paper

Table Pre-training: A Survey on Model Architectures, Pretraining Objectives, and Downstream Tasks
TLDR
This survey aims to provide a review of model designs, pre-training objectives, and downstream tasks for table pre- training, and to share the thoughts on existing challenges and future opportunities.
Rows from Many Sources: Enriching row completions from Wikidata with a pre-trained Language Model
TLDR
This work presents state-of-the-art results for subject suggestion and gap filling measured on a standard benchmark (WikiTables), and synthesizes additional rows using free text generation via GPT-3.
DiSCoMaT: Distantly Supervised Composition Extraction from Tables in Material Science Articles
A crucial component in the curation of KB for a scientific domain is information extraction from tables in the domain’s published articles – tables carry important information (often numeric), which
LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks
TLDR
The proposed Language-Interfaced Fine-Tuning (LIFT) does not make any changes to the model architecture or loss function, and it solely relies on the natural language interface, enabling “no-code machine learning with LMs,” and performs relatively well across a wide range of lowdimensional classification and regression tasks.
Generation-focused Table-based Intermediate Pre-training for Free-form Question Answering
TLDR
An intermediate pre-training framework, Generation-focused Table-based Intermediate Pre-training (GENTAP), that jointly learns representations of natural language questions and tables that enhance the question understanding and table representation abilities for complex questions is presented.
Table Retrieval May Not Necessitate Table-specific Model Design
TLDR
This work performs an analysis on a table-based portion of the Natural Questions dataset, and finds that DPR performs well without any table-specific design and training, and even achieves superior results compared to DTR when tuned on properly linearized tables.
TransTab: Learning Transferable Tabular Transformers Across Tables
TLDR
The goal of TransTab is to convert each sample to a generalizable embedding vector, and then apply stacked transformers for feature encoding, and one methodology insight is combining column description and table cells as the raw input to a gated transformer model.
XInfoTabS: Evaluating Multilingual Tabular Natural Language Inference
TLDR
This paper uses machine translation methods to construct a multilingual tabular NLI dataset, namely XINFOTABS, which expands the English tabularNLI dataset of INFOTABS to ten diverse languages and discovers that the XINfOTABS evaluation suite is both practical and challenging.
Right for the Right Reason: Evidence Extraction for Trustworthy Tabular Reasoning
TLDR
The task of Trustworthy Tabular Reasoning is introduced, where a model needs to extract evidence to be used for reasoning, in addition to predicting the label, and a two-stage sequential prediction approach is proposed, which includes an evidence extraction and an inference stage.
Robust (Controlled) Table-to-Text Generation with Structure-Aware Equivariance Learning
TLDR
This work proposes an equivariance learning framework, L ATTICE, which encodes tables with a structure-aware self-attention mechanism, and has improved T5-based models to offer better performance on ToTTo and HiTab.
...
...

References

SHOWING 1-10 OF 37 REFERENCES
TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data
TLDR
TaBERT is a pretrained LM that jointly learns representations for NL sentences and (semi-)structured tables that achieves new best results on the challenging weakly-supervised semantic parsing benchmark WikiTableQuestions, while performing competitively on the text-to-SQL dataset Spider.
Table Structure Recognition using Top-Down and Bottom-Up Cues
TLDR
This work presents an approach for table structure recognition that combines cell detection and interaction modules to localize the cells and predict their row and column associations with other detected cells, and incorporates structural constraints as additional differential components to the loss function for cell detection.
Deep Splitting and Merging for Table Structure Decomposition
TLDR
A pair of novel deep learning models (Split and Merge models) that given an input image, predicts the basic table grid pattern and predicts which grid elements should be merged to recover cells that span multiple rows or columns are presented.
Sato: Contextual Semantic Type Detection in Tables
TLDR
This work introduces Sato, a hybrid machine learning model to automatically detect the semantic types of columns in tables, exploiting the signals from the context as well as the column values, exceeding the state-of-the-art performance by a significant margin.
Sherlock: A Deep Learning Approach to Semantic Data Type Detection
TLDR
Sherlock is introduced, a multi-input deep neural network for detecting semantic types that achieves a support-weighted F$_1 score of $0.89, exceeding that of machine learning baselines, dictionary and regular expression benchmarks, and the consensus of crowdsourced annotations.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
2019. Sato: Contextual semantic type detection
  • 2019
TURL: Table Understanding through Representation Learning
TLDR
This paper proposes a structure-aware Transformer encoder to model the row-column structure of relational tables, and presents a new Masked Entity Recovery objective for pre-training to capture the semantics and knowledge in large-scale unlabeled data.
Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context
TLDR
This work presents Global Table Extractor, a vision-guided systematic framework for joint table detection and cell structured recognition, which could be built on top of any object detection model, and GTE-Cell, a new hierarchical cell detection network that leverages table styles.
TaPas: Weakly Supervised Table Parsing via Pre-training
TLDR
TaPas is presented, an approach to question answering over tables without generating logical forms that outperforms or rivals semantic parsing models by improving state-of-the-art accuracy on SQA and performing on par with the state of theart on WikiSQL and WikiTQ, but with a simpler model architecture.
...
...