TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data

@article{Yin2020TaBERTPF,
  title={TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data},
  author={Pengcheng Yin and Graham Neubig and Wen-tau Yih and Sebastian Riedel},
  journal={ArXiv},
  year={2020},
  volume={abs/2005.08314}
}
Recent years have witnessed the burgeoning of pretrained language models (LMs) for text-based natural language (NL) understanding tasks. Such models are typically trained on free-form NL text, hence may not be suitable for tasks like semantic parsing over structured data, which require reasoning over both free-form NL questions and structured tabular data (e.g., database tables). In this paper we present TaBERT, a pretrained LM that jointly learns representations for NL sentences and (semi… 
Retrieving Knowledge in Tabular Form from Free-form Natural Language
  • 2021
Language models pretrained on table data perform well in various table-related tasks. However, depending on the language, it can be difficult to obtain a large amount of table data for pretraining.
Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training
TLDR
A model pre-training framework, GenerationAugmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data to mitigate issues of existing general-purpose language models.
GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing
TLDR
GraPPa is an effective pre-training approach for table semantic parsing that learns a compositional inductive bias in the joint representations of textual and tabular data and significantly outperforms RoBERTa-large as the feature representation layers and establishes new state-of-the-art results on all of them.
Turning Tables: Generating Examples from Semi-structured Tables for Endowing Language Models with Reasoning Skills
TLDR
This work proposes to leverage semi-structured tables, and automatically generate at scale questionparagraph pairs, where answering the question requires reasoning over multiple facts in the paragraph, and adds a pre-training step over this synthetic data, which includes examples that require 16 different reasoning skills.
A Question Answering System for Unstructured Table Images
TLDR
This work presents a question answering system for unstructured table images that mainly consists of a table recognizer to recognize the tabular structure from an image and a table parser to generate the answer to a natural language question over the table.
Open Domain Question Answering over Virtual Documents: A Unified Approach for Data and Text
TLDR
This work proposes a verbalizer-retriever-reader framework for open-domain QA over data and text where verbalized tables from Wikipedia and triples from Wikidata are used as augmented knowledge sources and shows that verbalized knowledge is preferred for answer reasoning for both adapted and hot-swap settings.
SDCUP: Schema Dependency-Enhanced Curriculum Pre-Training for Table Semantic Parsing
TLDR
Two novel pre-training objectives are designed to impose the desired inductive bias into the learned representations for table pre- Training and a schema-aware curriculum learning approach is proposed to mitigate the impact of noise and learn effectively from the pre- training data in an easy-to-hard manner.
Dual Reader-Parser on Hybrid Textual and Tabular Evidence for Open Domain Question Answering
TLDR
This paper proposes a hybrid framework that takes both textual and tabular evidence as input and generates either direct answers or SQL queries depending on which form could better answer the question, and achieves state-of-theart performance on OpenSQuAD dataset using a T5-base model.
Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing
TLDR
This work presents BRIDGE, a powerful sequential architecture for modeling dependencies between natural language questions and relational databases in cross-DB semantic parsing that effectively captures the desired cross-modal dependencies and has the potential to generalize to more text-DB related tasks.
Multi-Instance Training for Question Answering Across Table and Linked Text
TLDR
The proposed MITQA, a new TextTableQA system that explicitly models the different but closely-related probability spaces of table row selection and text span selection, is proposed, achieving 21% absolute improvement on both EM and F1 scores over previous published results.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 56 REFERENCES
Neural Semantic Parsing with Type Constraints for Semi-Structured Tables
TLDR
A new semantic parsing model for answering compositional questions on semi-structured Wikipedia tables with a state-of-the-art accuracy and type constraints and entity linking are valuable components to incorporate in neural semantic parsers.
ERNIE: Enhanced Language Representation with Informative Entities
TLDR
This paper utilizes both large-scale textual corpora and KGs to train an enhanced language representation model (ERNIE) which can take full advantage of lexical, syntactic, and knowledge information simultaneously, and is comparable with the state-of-the-art model BERT on other common NLP tasks.
Semantic Parsing via Paraphrasing
TLDR
This paper presents two simple paraphrase models, an association model and a vector space model, and trains them jointly from question-answer pairs, improving state-of-the-art accuracies on two recently released question-answering datasets.
Global Reasoning over Database Structures for Text-to-SQL Parsing
TLDR
This work uses message-passing through a graph neural network to softly select a subset of database constants for the output query, conditioned on the question, and trains a model to rank queries based on the global alignment ofdatabase constants to question words.
Compositional Semantic Parsing on Semi-Structured Tables
TLDR
This paper proposes a logical-form driven parsing algorithm guided by strong typing constraints and shows that it obtains significant improvements over natural baselines and is made publicly available.
TabFact: A Large-scale Dataset for Table-based Fact Verification
TLDR
A large-scale dataset with 16k Wikipedia tables as the evidence for 118k human-annotated natural language statements, which are labeled as either ENTAILED or REFUTED is constructed and two different models are designed: Table-BERT and Latent Program Algorithm (LPA).
Learned in Translation: Contextualized Word Vectors
TLDR
Adding context vectors to a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation to contextualize word vectors improves performance over using only unsupervised word and character vectors on a wide variety of common NLP tasks.
Knowledge Enhanced Contextual Word Representations
TLDR
After integrating WordNet and a subset of Wikipedia into BERT, the knowledge enhanced BERT (KnowBert) demonstrates improved perplexity, ability to recall facts as measured in a probing task and downstream performance on relationship extraction, entity typing, and word sense disambiguation.
Iterative Search for Weakly Supervised Semantic Parsing
TLDR
A novel iterative training algorithm is proposed that alternates between searching for consistent logical forms and maximizing the marginal likelihood of the retrieved ones, thus dealing with the problem of spuriousness.
Semantic Parsing on Freebase from Question-Answer Pairs
TLDR
This paper trains a semantic parser that scales up to Freebase and outperforms their state-of-the-art parser on the dataset of Cai and Yates (2013), despite not having annotated logical forms.
...
1
2
3
4
5
...