DART: Open-Domain Structured Data Record to Text Generation

@inproceedings{Radev2021DARTOS,
  title={DART: Open-Domain Structured Data Record to Text Generation},
  author={Dragomir Radev and Rui Zhang and Amrit Rau and Abhinand Sivaprasad and Chia-Hsuan Hsieh and Nazneen Rajani and Xiangru Tang and Aadit Vyas and Neha Verma and Pranav Krishna and Yangxiaokang Liu and Nadia Irwanto and Jessica Pan and Faiaz Rahman and Ahmad Zaidi and Murori Mutuma and Yasin Tarabar and Ankit Gupta and Tao Yu and Yi Chern Tan and Xi Victoria Lin and Caiming Xiong and Richard Socher},
  booktitle={NAACL},
  year={2021}
}
We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploiting the semantic dependencies among table headers and the table title. Our dataset construction… 
FeTaQA: Free-form Table Question Answering
TLDR
This work introduces FeTaQA, a new dataset with 10K Wikipedia-based {table, question, free-form answer, supporting table cells} pairs, and provides two benchmark methods for the proposed task: a pipeline method based on semantic parsing-based QA systems and an end-to-end methodbased on large pretrained text generation models.
Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training
TLDR
A model pre-training framework, GenerationAugmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data to mitigate issues of existing general-purpose language models.
Generating Wikipedia Article Sections from Diverse Data Sources
TLDR
This work creates a large-scale dataset, WIKITABLET, that pairs Wikipedia sections with their corresponding tabular data and various metadata and shows that the best approaches can generate fluent and high quality texts but they sometimes struggle with coherence.
The CACAPO Dataset: A Multilingual, Multi-Domain Dataset for Neural Pipeline and End-to-End Data-to-Text Generation
TLDR
The CACAPO dataset is multilingual (Dutch and English), and contains almost 10,000 sentences from human-written news texts in the sports, weather, stocks, and incidents domain, together with aligned attribute-value paired data.
An Investigation of Fine-tuning Pre-trained Model for MR-to-Text Generation
  • Ting Hu, C. Meinel
  • Computer Science
    2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA)
  • 2020
TLDR
Different methods to organize the MRs are explored and it is shown that just linearizing the information in MRs achieve decent results, while complex annotation process can be omitted.
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
TLDR
GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics, is introduced and the description of the data for the 2021 shared task at the associated GEM Workshop is described.
Prefix-Tuning: Optimizing Continuous Prompts for Generation
TLDR
Prefix-tuning is proposed, a lightweight alternative to fine- Tuning for natural language generation tasks, which keeps language model parameters frozen and instead optimizes a sequence of continuous task-specific vectors, which is called the prefix.
Do Fine-tuned Commonsense Language Models Really Generalize?
TLDR
Clear evidence is found that fine-tuned commonsense language models still do not generalize well, even with moderate changes to the experimental setup, and may, in fact, be susceptible to dataset bias.
Differentially Private Fine-tuning of Language Models
We give simpler, sparser, and faster algorithms for differentially private fine-tuning of large-scale pre-trained language models, which achieve the state-of-the-art privacy versus utility tradeoffs
Unifying Language Learning Paradigms
TLDR
A unified framework for pre-training models that are universally effective across datasets and setups is presented and Mixture-of-Denoisers (MoD), a pre- Training objective that combines diverse pre- training paradigms together is proposed.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 109 REFERENCES
GTR-LSTM: A Triple Encoder for Sentence Generation from RDF Data
TLDR
This work proposes a system to translate a set of RDF triples into natural sentences based on an encoder-decoder framework that encodes not only the elements of the triple but also the relationships both within a triple and between the triples.
Triple-to-Text: Converting RDF Triples into High-Quality Natural Languages via Optimizing an Inverse KL Divergence
TLDR
This paper proposes a novel Triple-to-Text (T2T) framework, which approximately optimizes the inverse Kullback-Leibler (KL) divergence between the distributions of the real and generated sentences and demonstrates that T2T can generate higher-quality sentences and outperform baseline models in several evaluation metrics.
Table-to-Text: Describing Table Region With Natural Language
TLDR
A generative model to generate a natural language sentence describing a table region, e.g., a row, and a flexible copying mechanism that selectively replicates contents from the table in the output sequence is presented.
Creating Training Corpora for NLG Micro-Planners
TLDR
This paper proposes the corpus generation framework as a novel method for creating challenging data sets from which NLG models can be learned which are capable of handling the complex interactions occurring during in micro-planning between lexicalisation, aggregation, surface realisation, referring expression generation and sentence segmentation.
Order-Planning Neural Text Generation From Structured Data
TLDR
This paper proposes an order-planning text generation model to capture the relationship between different fields and use such relationship to make the generated text more fluent and smooth.
Logical Natural Language Generation from Open-Domain Tables
TLDR
A new NLG task where a model is tasked with generating natural language statements that can be logically entailed by the facts in an open-domain semi-structured table is suggested, and new automatic metrics to evaluate the fidelity of generation models w.r.t. logical inference are proposed.
Describing a Knowledge Base
TLDR
This work builds a generation framework based on a pointer network which can copy facts from the input KB, and adds two attention mechanisms: (i) slot-aware attention to capture the association between a slot type and its corresponding slot value; and (ii) a new table position self-attention to captured the inter-dependencies among related slots.
Few-Shot NLG with Pre-Trained Language Model
TLDR
This work proposes the new task of few-shot natural language generation and proposes a simple yet effective approach that achieves very reasonable performances and outperforms the strongest baseline by an average of over 8.0 BLEU points improvement.
Table-to-text Generation by Structure-aware Seq2seq Learning
TLDR
The attention visualizations and case studies show that the novel structure-aware seq2seq architecture which consists of field-gating encoder and description generator with dual attention is capable of generating coherent and informative descriptions based on the comprehensive understanding of both the content and the structure of a table.
Data-to-text Generation with Entity Modeling
TLDR
This work proposes an entity-centric neural architecture for data-to-text generation which creates entity-specific representations which are dynamically updated and outperforms competitive baselines in automatic and human evaluation.
...
1
2
3
4
5
...