Optimizing Deeper Transformers on Small Datasets

@inproceedings{Xu2021OptimizingDT,
  title={Optimizing Deeper Transformers on Small Datasets},
  author={Peng Xu and Dhruv Kumar and Wei Yang and Wenjie Zi and Keyi Tang and Chenyang Huang and Jackie Chi Kit Cheung and Simon Prince and Yanshuai Cao},
  booktitle={ACL},
  year={2021}
}
It is a common belief that training deep transformers from scratch requires large datasets. Consequently, for small datasets, people usually use shallow and simple additional layers on top of pre-trained models during fine-tuning. This work shows that this does not always need to be the case: with proper initialization and optimization, the benefits of very deep transformers can carry over to challenging tasks with small datasets, including Text-to-SQL semantic parsing and logical reading… 

Figures and Tables from this paper

DeepNet: Scaling Transformers to 1, 000 Layers
TLDR
A new normalization function (DEEPNORM) is introduced to modify the residual connection in Transformer, accompanying with theoretically derived initialization, which successfully scale Transformers up to 1,000 layers without difficulty, which is one order of magnitude deeper than previous deep Transformers.
Hierarchical Neural Data Synthesis for Semantic Parsing
TLDR
This work proposes a purely neural approach of data augmentation for semantic parsing that completely removes the need for grammar engineering while achieving higher semantic parsing accuracy on the Spider cross-domain text-toSQL semantic parsing benchmark.
mRAT-SQL+GAP: A Portuguese Text-to-SQL Transformer
TLDR
This work changed the RAT-SQL+GAP system by relying on a multilingual BART model, and produced a translated version of the Spider dataset, which can help other researchers to produce results in Machine Learning in a language different from English.
PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models
TLDR
On the challenging Spider and CoSQL text-to-SQL translation tasks, it is shown that PICARD transforms fine-tuned T5 models with passable performance into state-of-the-art solutions.
HIE-SQL: History Information Enhanced Network for Context-Dependent Text-to-SQL Semantic Parsing
TLDR
This work proposes a History Information Enhanced text-to-SQL model (HIE-SQL) to exploit context dependence information from both history utterances and the last predicted SQL query, and proposes a bimodal pre-trained model to bridge the gap between them.
LGESQL: Line Graph Enhanced Text-to-SQL Model with Mixed Local and Non-Local Relations
TLDR
This work proposes a Line Graph Enhanced Text-to-SQL (LGESQL) model to mine the underlying relational features without constructing meta-paths, and designs an auxiliary task called graph pruning to improve the discriminative capability of the encoder.
Transformers in Time-series Analysis: A Tutorial
TLDR
An explanation of the core components of the Transformer, including the self-attention mechanism, positional encoding, multi-head, and encoder/decoder are explained.
SADGA: Structure-Aware Dual Graph Aggregation Network for Text-to-SQL
TLDR
In SADGA, the graph structure is adopted to provide a unified encoding model for both the natural language question and database schema and a structure-aware aggregation method is devised to learn the mapping between the question-graph and schema-graph.
Zero-shot Code-Mixed Offensive Span Identification through Rationale Extraction
TLDR
This paper investigates the effectiveness of sentence-level transformers for zero-shot offensive span identification on a code-mixed Tamil dataset and finds both LIME and IG to show significant improvement with Masked Data Augmentation and Multilabel Training.
A Globally Normalized Neural Model for Semantic Parsing
TLDR
This paper proposes a globally normalized model for context-free grammar (CFG)-based semantic parsing that predicts a real-valued score at each step and does not suffer from the label bias problem.
...
...

References

SHOWING 1-10 OF 34 REFERENCES
Improving Transformer Optimization Through Better Initialization
TLDR
This work investigates and empirically validate the source of optimization problems in the encoder-decoder Transformer architecture; it proposes a new weight initialization scheme with theoretical justification, that enables training without warmup or layer normalization, and achieves leading accuracy.
Understanding the Difficulty of Training Transformers
TLDR
It is revealed that for each layer in a multi-layer Transformer model, heavy dependency on its residual branch makes training unstable since it amplifies small parameter perturbations and result in significant disturbances in the model output, yet a light dependency limits the potential of model training and can lead to an inferior trained model.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Transformers without Tears: Improving the Normalization of Self-Attention
TLDR
It is shown that pre-norm residual connections (PRENORM) and smaller initializations enable warmup-free, validation-based training with large learning rates and proposed l2 normalization with a single scale parameter (SCALENORN) for faster training and better performance.
Learning Deep Transformer Models for Machine Translation
TLDR
It is claimed that a truly deep Transformer model can surpass the Transformer-Big counterpart by 1) proper use of layer normalization and 2) a novel way of passing the combination of previous layers to the next.
GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing
TLDR
GraPPa is an effective pre-training approach for table semantic parsing that learns a compositional inductive bias in the joint representations of textual and tabular data and significantly outperforms RoBERTa-large as the feature representation layers and establishes new state-of-the-art results on all of them.
Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention
TLDR
Results on WMT and IWSLT translation tasks with five translation directions show that deep Transformers with DS-Init and MAtt can substantially outperform their base counterpart in terms of BLEU, while matching the decoding speed of the baseline model thanks to the efficiency improvements of MAtt.
Lipschitz Constrained Parameter Initialization for Deep Transformers
TLDR
This paper empirically demonstrate that a simple modification made in the official implementation, which changes the computation order of residual connection and layer normalization, can significantly ease the optimization of deep Transformers and presents a parameter initialization method that leverages the Lipschitz constraint on the initialization of Transformer parameters that effectively ensures training convergence.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training
TLDR
A model pre-training framework, GenerationAugmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data to mitigate issues of existing general-purpose language models.
...
...