The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers

  title={The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers},
  author={R. Csord{\'a}s and Kazuki Irie and J{\"u}rgen Schmidhuber},
Recently, many datasets have been proposed to test the systematic generalization ability of neural networks. The companion baseline Transformers, typically trained with default hyper-parameters from standard tasks, are shown to fail dramatically. Here we demonstrate that by revisiting model configurations as basic as scaling of embeddings, early stopping, relative positional embedding, and Universal Transformer variants, we can drastically improve the performance of Transformers on systematic… 

Learning Adaptive Control Flow in Transformers for Improved Systematic Generalization

The novel Neural Data Router (NDR) achieves 100% length generalization accuracy on the compositional table lookup task, and its attention and gating patterns tend to be interpretable as an intuitive form of neural routing.

The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization

The novel Neural Data Router (NDR) achieves 100% length generalization accuracy on the classic compositional table lookup task, as well as near-perfect accuracy on a simple arithmetic task and a new variant of ListOps testing for generalization across computational depths.

Improving Baselines in the Wild

This study focuses on two datasets: iWildCam and FMoW, and shows that conducting separate cross-validation for each evaluation metric is crucial for both datasets.

From SCAN to Real Data: Systematic Generalization via Meaningful Learning

This paper revisits systematic generalization from the perspective of meaningful learning, an exceptional capability of humans to learn new concepts by connecting them with other previously known knowledge, and proposes to augment a training dataset in either an inductive or deductive manner to build semantic links between new and old concepts.

Iterative Decoding for Compositional Generalization in Transformers

This paper introduces iterative decoding, an alternative toseq2seq that improves transformer compositional generalization in the PCFG and Cartesian product datasets and evidences that, in these datasets, seq2seq transformers do not learn iterations that are not unrolled.

Evaluating the Impact of Model Scale for Compositional Generalization in Semantic Parsing

Limits of current techniques for effectively leveraging model scale for compositional generalization are highlighted, while the analysis also suggests promising directions for future work.

Compositional generalization in semantic parsing with pretrained transformers

It is shown that language models pretrained exclusively with nonEnglish corpora, or even with programming language corporA, significantly improve out-of-distribution generalization in these benchmarks, compared with models trained from scratch, even though both benchmarks are English-based.

Improving Compositional Generalization with Latent Structure and Data Augmentation

This work presents a more powerful data recombination method using a model called Compositional Structure Learner (CSL), a generative model with a quasi-synchronous context-free grammar backbone, which results in a model even stronger than a T5-CSL ensemble on two real world compositional generalization tasks.

Structurally Diverse Sampling Reduces Spurious Correlations in Semantic Parsing Datasets

This work proposes a novel algorithm for sampling a structurally diverse set of instances from a labeled instance pool with structured outputs that leads to better generalization and uses information theory to show that reduction in spurious correlations between substructures may be one reason why diverse training sets improve generalization.

Systematic Generalization with Edge Transformers

The Edge Transformer is a new model that combines inspiration from Transformers and rulebased symbolic AI that outperforms Relation-aware, Universal and classical Transformer baselines on compositional generalization benchmarks in relational reasoning, semantic parsing, and dependency parsing.



CLOSURE: Assessing Systematic Generalization of CLEVR Models

Surprisingly, it is found that an explicitly compositional Neural Module Network model also generalizes badly on CLOSURE, even when it has access to the ground-truth programs at test time.

Improving Transformer Optimization Through Better Initialization

This work investigates and empirically validate the source of optimization problems in the encoder-decoder Transformer architecture; it proposes a new weight initialization scheme with theoretical justification, that enables training without warmup or layer normalization, and achieves leading accuracy.

GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training

GradInit is an automated and architecture agnostic method for initializing neural networks based on a simple heuristic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value.

Measuring Generalization and Overfitting in Machine Learning

This work suggests that the true concern for robust machine learning is distribution shift rather than overfitting, and designing models that still work reliably in dynamic environments is a challenging but necessary undertaking.

Transcoding Compositionally: Using Attention to Find More Generalizable Solutions

This paper presents seq2attn, a new architecture that is specifically designed to exploit attention to find compositional patterns in the input, and exhibits overgeneralization to a larger degree than a standard sequence-to-sequence model.

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

This work proposes a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit and derives a robust initialization method that particularly considers the rectifier nonlinearities.

Universal Transformers

The Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the Transformer model and which addresses issues of parallelizability and global receptive field, is proposed.

Memorize or generalize? Searching for a compositional RNN in a haystack

This paper proposes the lookup table composition domain as a simple setup to test compositional behaviour and shows that it is theoretically possible for a standard RNN to learn to behave compositionally in this domain when trained with standard gradient descent and provided with additional supervision.

Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

Results on WMT and IWSLT translation tasks with five translation directions show that deep Transformers with DS-Init and MAtt can substantially outperform their base counterpart in terms of BLEU, while matching the decoding speed of the baseline model thanks to the efficiency improvements of MAtt.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.