# ToTTo: A Controlled Table-To-Text Generation Dataset

@article{Parikh2020ToTToAC,
title={ToTTo: A Controlled Table-To-Text Generation Dataset},
author={Ankur P. Parikh and Xuezhi Wang and Sebastian Gehrmann and Manaal Faruqui and Bhuwan Dhingra and Diyi Yang and Dipanjan Das},
journal={ArXiv},
year={2020},
volume={abs/2004.14373}
}
• Published 29 April 2020
• Computer Science
• ArXiv
We present ToTTo, an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. To obtain generated targets that are natural but also faithful to the source table, we introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia. We present systematic analyses of our dataset and…
114 Citations

## Figures and Tables from this paper

Towards a Decomposable Metric for Explainable Evaluation of Text Generation from AMR
• Computer Science
EACL
• 2021
This work proposes \mathcal{M}_\beta, a decomposable metric that builds on two pillars that measures the linguistic quality of the generated text, and shows that fulfillment of both principles offers benefits for AMR-to-text evaluation, including explainability of scores.
Towards a Decomposable Metric for Explainable Evaluation of Text Generation from AMR
• Computer Science
• 2020
This work proposes MFβ, an automatic metric that builds on a principle of meaning preservationM, which measures to what extent the original AMR graph can be reconstructed from the generated sentence, and implements it using SOTA language models.
It is indicated that text-to-text pre-training in the form of T5 enables simple, end- to-end transformer based models to outperform pipelined neural architectures tailored for data-to/text generation, as well as alternatives such as BERT and GPT-2.
DART: Open-Domain Structured Data Record to Text Generation
• Computer Science
NAACL
• 2021
The dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and spoken dialogue systems by utilizing techniques including tree ontology annotation, question-answer pair to declarative sentence conversion, and predicate unification, all with minimum post-editing.
Input-Tuning: Adapting Unfamiliar Inputs to Frozen Pretrained Models
• Computer Science
ArXiv
• 2022
It is argued that one of the factors hindering the development of prompt-tuning on NLG tasks is the unfamiliar inputs, leading to a more effective way to adapt unfamiliar inputs to frozen PLMs.
Controlling hallucinations at word level in data-to-text generation
• Computer Science
Data Min. Knowl. Discov.
• 2022
A finer-grained approach to hallucinations, arguing that hallucinations should rather be treated at the word level, is proposed, which is able to reduce and control hallucinations, while keeping fluency and coherence in generated texts.
Towards Generating Financial Reports From Table Data Using Transformers
• Computer Science
• 2021
A transformer network is implemented to solve the task of generating matching pairs between tables and sentences found in financial documents, and was able to achieve promising results, with the final model reaching a BLEU score of 63.3.
PLOG: Table-to-Logic Pretraining for Logical Table-to-Text Generation
• Computer Science
ArXiv
• 2022
On two benchmarks, L OGIC NLG and C ONT L OG , PL O G outperforms strong baselines by a large margin on the logical ﬁdelity, demonstrating the effectiveness of table-to-logic pretraining.
R2D2: Robust Data-to-Text with Replacement Detection
• Computer Science
ArXiv
• 2022
R2D2 is introduced, a training framework that addresses unfaithful Data-to-Text generation by training a system both as a generator and a faithfulness discriminator with additional replacement detection and unlikelihood learning tasks.
How Do Seq2Seq Models Perform on End-to-End Data-to-Text Generation?
• Computer Science
ACL
• 2022
Annotation of the outputs of five models on four datasets with eight error types finds that copy mechanism is helpful for the improvement in Omission and Inaccuracy Extrinsic errors but it increases other types of errors such as Addition.

## References

SHOWING 1-10 OF 51 REFERENCES
Data-to-Text Generation with Content Selection and Planning
• Computer Science
AAAI
• 2019
This work presents a neural network architecture which incorporates content selection and planning without sacrificing end-to-end training and shows that this model outperforms strong baselines improving the state-of-the-art on the recently released RotoWire dataset.
Challenges in Data-to-Document Generation
• Computer Science
EMNLP
• 2017
A new, large-scale corpus of data records paired with descriptive documents is introduced, a series of extractive evaluation methods for analyzing performance are proposed, and baseline results are obtained using current neural generation methods.
Creating Training Corpora for NLG Micro-Planners
This paper proposes the corpus generation framework as a novel method for creating challenging data sets from which NLG models can be learned which are capable of handling the complex interactions occurring during in micro-planning between lexicalisation, aggregation, surface realisation, referring expression generation and sentence segmentation.
The E2E Dataset: New Challenges For End-to-End Generation
• Computer Science
SIGDIAL Conference
• 2017
The E2E dataset poses new challenges: (1) its human reference texts show more lexical richness and syntactic variation, including discourse phenomena; (2) generating from this set requires content selection, which promises more natural, varied and less template-like system utterances.
Sticking to the Facts: Confident Decoding for Faithful Data-to-Text Generation
• Ran Tian, Shashi Narayan
• Computer Science
ArXiv
• 2019
This work proposes a novel confidence oriented decoder that assigns a confidence score to each target position in training using a variational Bayes objective, and can be leveraged at inference time using a calibration technique to promote more faithful generation.
Handling Divergent Reference Texts when Evaluating Table-to-Text Generation
• Computer Science
ACL
• 2019
A new metric is proposed, PARENT, which aligns n-grams from the reference and generated texts to the semi-structured data before computing their precision and recall, and is applicable when the reference texts are elicited from humans using the data from the WebNLG challenge.
Neural Text Generation from Structured Data with Application to the Biography Domain
• Computer Science
EMNLP
• 2016
A neural model for concept-to-text generation that scales to large, rich domains and significantly out-performs a classical Kneser-Ney language model adapted to this task by nearly 15 BLEU is introduced.
Get To The Point: Summarization with Pointer-Generator Networks
• Computer Science
ACL
• 2017
A novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways, using a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator.
Table-to-text Generation by Structure-aware Seq2seq Learning
• Computer Science
AAAI
• 2018
The attention visualizations and case studies show that the novel structure-aware seq2seq architecture which consists of field-gating encoder and description generator with dual attention is capable of generating coherent and informative descriptions based on the comprehensive understanding of both the content and the structure of a table.
Annotation Artifacts in Natural Language Inference Data
• Computer Science
NAACL
• 2018
It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.