Finding needles in a haystack: Sampling Structurally-diverse Training Sets from Synthetic Data for Compositional Generalization

  title={Finding needles in a haystack: Sampling Structurally-diverse Training Sets from Synthetic Data for Compositional Generalization},
  author={Inbar Oren and Jonathan Herzig and Jonathan Berant},
Modern semantic parsers suffer from two principal limitations. First, training requires expensive collection of utterance-program pairs. Second, semantic parsers fail to generalize at test time to new compositions/structures that have not been observed during training. Recent research has shown that automatic generation of synthetic utterance-program pairs can alleviate the first problem, but its potential for the second has thus far been under-explored. In this work, we investigate automatic… 
Learning to Generalize Compositionally by Transferring Across Semantic Parsing Tasks
This work investigates learning representations that facilitate transfer learning from one compositional task to another: the representation and the task-specific layers of the models are strategically trained differently on a pre-finetuning task such that they generalize well on mismatched splits that require compositionality.
Unobserved Local Structures Make Compositional Generalization Hard
A criterion for the difficulty of an example is proposed: a test instance is hard if it contains a local structure that was not observed at training time and it predicts instance-level generalization well across 5 different semantic parsing datasets, substantially better than alternative decision rules.
Dyna-bAbI: unlocking bAbI's potential with dynamic synthetic benchmarking
Dyna-bAbI is developed, a dynamic framework providing fine-grained control over task generation in bAbI, underscoring the importance of highly controllable task generators for creating robust NLU systems through a virtuous cycle of model and data development.
Paraphrasing Techniques for Maritime QA system
This paper investigates how to exploit paraphrasing methods for the automated generation of large-scale training datasets (in the form of paraphrased utterances and their corresponding logical forms in SQL format) and presents experimental results using real-world data in the maritime domain.
Structurally Diverse Sampling Reduces Spurious Correlations in Semantic Parsing Datasets
This work proposes a novel algorithm for sampling a structurally diverse set of instances from a labeled instance pool with structured outputs that leads to better generalization and uses information theory to show that reduction in spurious correlations between substructures may be one reason why diverse training sets improve generalization.
Compositional Generalization and Decomposition in Neural Program Synthesis
A suite of generalization tasks, which measure different types of compositional generalization that are desirable for program synthesis and are particularly difficult for current sequence to sequence models, are proposed.
  • 2022
Improving Compositional Generalization with Latent Structure and Data Augmentation
This work presents a more powerful data recombination method using a model called Compositional Structure Learner (CSL), a generative model with a quasi-synchronous context-free grammar backbone, which results in a model even stronger than a T5-CSL ensemble on two real world compositional generalization tasks.


Improving Compositional Generalization in Semantic Parsing
This work analyzes a wide variety of models and proposes multiple extensions to the attention module of the semantic parser, aiming to improve compositional generalization in semantic parsing, as output programs are constructed from sub-components.
Learning to Synthesize Data for Semantic Parsing
This work proposes a generative model which features a (non-neural) PCFG that models the composition of programs, and a BART-based translation model that maps a program to an utterance.
Compositional Generalization via Semantic Tagging
This work decomposes decoding into two phases where an input utterance is first tagged with semantic symbols representing the meanings of its individual words, and then a sequence-to-sequence model is used to predict the final meaning representation conditioning on the utterance and the predicted tag sequence.
Generating Synthetic Data for Task-Oriented Semantic Parsing with Hierarchical Representations
This work explores the possibility of generating synthetic data for neural semantic parsing using a pretrained denoising sequence-to-sequence model (i.e., BART) and uses an auxiliary parser (AP) to filter the generated utterances.
Compositional Generalization in Semantic Parsing: Pre-training vs. Specialized Architectures
It is shown that masked language model (MLM) pre-training rivals SCAN-inspired architectures on primitive holdout splits and establishes a new state of the art on the CFQ compositional generalization benchmark using MLM pre- training together with an intermediate representation.
Meta-Learning to Compositionally Generalize
A meta-learning augmented version of supervised learning whose objective directly optimizes for out-of-distribution generalization is implemented, and Experimental results on the COGS and SCAN datasets show that this similarity-driven meta- learning can improve generalization performance.
Unlocking Compositional Generalization in Pre-trained Models Using Intermediate Representations
It is highlighted that intermediate representations provide an important and potentially overlooked degree of freedom for improving the compositional generalization abilities of pre-trained seq2seq models.
Good-Enough Compositional Data Augmentation
A simple data augmentation protocol aimed at providing a compositional inductive bias in conditional and unconditional sequence models that reduces error rate by as much as 87% on diagnostic tasks from the SCAN dataset and 16% on a semantic parsing task.
*-CFQ: Analyzing the Scalability of Machine Learning on a Compositional Task
It is shown that compositional generalization remains a challenge at all training sizes, and that increasing the scope of natural language leads to consistently higher error rates, which are only partially offset by increased training data.
Compositional Generalization for Primitive Substitutions
This paper conducts fundamental research for encoding compositionality in neural networks with two representations, one generating attention maps, and the other mapping attended input words to output symbols to improve generalization.