ToTTo: A Controlled Table-To-Text Generation Dataset

@article{Parikh2020ToTToAC,
  title={ToTTo: A Controlled Table-To-Text Generation Dataset},
  author={Ankur P. Parikh and Xuezhi Wang and Sebastian Gehrmann and Manaal Faruqui and Bhuwan Dhingra and Diyi Yang and Dipanjan Das},
  journal={ArXiv},
  year={2020},
  volume={abs/2004.14373}
}
We present ToTTo, an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. To obtain generated targets that are natural but also faithful to the source table, we introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia. We present systematic analyses of our dataset and… 
Towards a Decomposable Metric for Explainable Evaluation of Text Generation from AMR
TLDR
This work proposes \mathcal{M}_\beta, a decomposable metric that builds on two pillars that measures the linguistic quality of the generated text, and shows that fulfillment of both principles offers benefits for AMR-to-text evaluation, including explainability of scores.
Towards a Decomposable Metric for Explainable Evaluation of Text Generation from AMR
TLDR
This work proposes MFβ, an automatic metric that builds on a principle of meaning preservationM, which measures to what extent the original AMR graph can be reconstructed from the generated sentence, and implements it using SOTA language models.
Text-to-Text Pre-Training for Data-to-Text Tasks
TLDR
It is indicated that text-to-text pre-training in the form of T5 enables simple, end- to-end transformer based models to outperform pipelined neural architectures tailored for data-to/text generation, as well as alternatives such as BERT and GPT-2.
DART: Open-Domain Structured Data Record to Text Generation
TLDR
The dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and spoken dialogue systems by utilizing techniques including tree ontology annotation, question-answer pair to declarative sentence conversion, and predicate unification, all with minimum post-editing.
Input-Tuning: Adapting Unfamiliar Inputs to Frozen Pretrained Models
TLDR
It is argued that one of the factors hindering the development of prompt-tuning on NLG tasks is the unfamiliar inputs, leading to a more effective way to adapt unfamiliar inputs to frozen PLMs.
Controlling hallucinations at word level in data-to-text generation
TLDR
A finer-grained approach to hallucinations, arguing that hallucinations should rather be treated at the word level, is proposed, which is able to reduce and control hallucinations, while keeping fluency and coherence in generated texts.
Towards Generating Financial Reports From Table Data Using Transformers
TLDR
A transformer network is implemented to solve the task of generating matching pairs between tables and sentences found in financial documents, and was able to achieve promising results, with the final model reaching a BLEU score of 63.3.
PLOG: Table-to-Logic Pretraining for Logical Table-to-Text Generation
TLDR
On two benchmarks, L OGIC NLG and C ONT L OG , PL O G outperforms strong baselines by a large margin on the logical fidelity, demonstrating the effectiveness of table-to-logic pretraining.
R2D2: Robust Data-to-Text with Replacement Detection
TLDR
R2D2 is introduced, a training framework that addresses unfaithful Data-to-Text generation by training a system both as a generator and a faithfulness discriminator with additional replacement detection and unlikelihood learning tasks.
How Do Seq2Seq Models Perform on End-to-End Data-to-Text Generation?
TLDR
Annotation of the outputs of five models on four datasets with eight error types finds that copy mechanism is helpful for the improvement in Omission and Inaccuracy Extrinsic errors but it increases other types of errors such as Addition.
...
...

References

SHOWING 1-10 OF 51 REFERENCES
Data-to-Text Generation with Content Selection and Planning
TLDR
This work presents a neural network architecture which incorporates content selection and planning without sacrificing end-to-end training and shows that this model outperforms strong baselines improving the state-of-the-art on the recently released RotoWire dataset.
Challenges in Data-to-Document Generation
TLDR
A new, large-scale corpus of data records paired with descriptive documents is introduced, a series of extractive evaluation methods for analyzing performance are proposed, and baseline results are obtained using current neural generation methods.
Creating Training Corpora for NLG Micro-Planners
TLDR
This paper proposes the corpus generation framework as a novel method for creating challenging data sets from which NLG models can be learned which are capable of handling the complex interactions occurring during in micro-planning between lexicalisation, aggregation, surface realisation, referring expression generation and sentence segmentation.
The E2E Dataset: New Challenges For End-to-End Generation
TLDR
The E2E dataset poses new challenges: (1) its human reference texts show more lexical richness and syntactic variation, including discourse phenomena; (2) generating from this set requires content selection, which promises more natural, varied and less template-like system utterances.
Sticking to the Facts: Confident Decoding for Faithful Data-to-Text Generation
TLDR
This work proposes a novel confidence oriented decoder that assigns a confidence score to each target position in training using a variational Bayes objective, and can be leveraged at inference time using a calibration technique to promote more faithful generation.
Handling Divergent Reference Texts when Evaluating Table-to-Text Generation
TLDR
A new metric is proposed, PARENT, which aligns n-grams from the reference and generated texts to the semi-structured data before computing their precision and recall, and is applicable when the reference texts are elicited from humans using the data from the WebNLG challenge.
Neural Text Generation from Structured Data with Application to the Biography Domain
TLDR
A neural model for concept-to-text generation that scales to large, rich domains and significantly out-performs a classical Kneser-Ney language model adapted to this task by nearly 15 BLEU is introduced.
Get To The Point: Summarization with Pointer-Generator Networks
TLDR
A novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways, using a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator.
Table-to-text Generation by Structure-aware Seq2seq Learning
TLDR
The attention visualizations and case studies show that the novel structure-aware seq2seq architecture which consists of field-gating encoder and description generator with dual attention is capable of generating coherent and informative descriptions based on the comprehensive understanding of both the content and the structure of a table.
Annotation Artifacts in Natural Language Inference Data
TLDR
It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.
...
...