Memory-Efficient Differentiable Transformer Architecture Search

  title={Memory-Efficient Differentiable Transformer Architecture Search},
  author={Yuekai Zhao and Li Dong and Yelong Shen and Zhihua Zhang and Furu Wei and Weizhu Chen},
Differentiable architecture search (DARTS) is successfully applied in many vision tasks. However, directly using DARTS for Transformers is memory-intensive, which renders the search process infeasible. To this end, we propose a multi-split reversible network and combine it with DARTS. Specifically, we devise a backpropagation-with-reconstruction algorithm so that we only need to store the last layer’s outputs. By relieving the memory burden for DARTS, it allows us to search with larger hidden… Expand

Figures and Tables from this paper

A Survey of Transformers
This survey provides a comprehensive review of various Transformer variants and proposes a new taxonomy of X-formers from three perspectives: architectural modification, pre-training, and applications. Expand
Multi-head or Single-head? An Empirical Comparison for Transformer Training
It is shown that, with recent advances in deep learning, the training of the 384-layer Transformer can be successfully stabilized, as the training difficulty is no longer a bottleneck, and substantially deeper single-head Transformer achieves consistent performance improvements without tuning hyper-parameters. Expand
RankNAS: Efficient Neural Architecture Search by Pairwise Ranking
  • Chi Hu, Chenglong Wang, +5 authors Changliang Li
  • Computer Science
  • 2021
This paper addresses the efficiency challenge of Neural Architecture Search (NAS) by formulating the task as a ranking problem. Previous methods require numerous training examples to estimate theExpand


The Evolved Transformer
The Progressive Dynamic Hurdles method is developed, which allows us to dynamically allocate more resources to more promising candidate models on the computationally expensive WMT 2014 English-German translation task, and demonstrates consistent improvement over the Transformer on four well-established language tasks. Expand
DARTS: Differentiable Architecture Search
The proposed algorithm excels in discovering high-performance convolutional architectures for image classification and recurrent architectures for language modeling, while being orders of magnitude faster than state-of-the-art non-differentiable techniques. Expand
PC-DARTS: Partial Channel Connections for Memory-Efficient Architecture Search
This paper presents a novel approach, namely Partially-Connected DARTS, by sampling a small part of super-net to reduce the redundancy in exploring the network space, thereby performing a more efficient search without comprising the performance. Expand
Progressive Differentiable Architecture Search: Bridging the Depth Gap Between Search and Evaluation
This paper presents an efficient algorithm which allows the depth of searched architectures to grow gradually during the training procedure, and solves two issues, namely, heavier computational overheads and weaker search stability, which are solved using search space approximation and regularization. Expand
DSNAS: Direct Neural Architecture Search Without Parameter Retraining
  • Shou-Yong Hu, Sirui Xie, +4 authors Dahua Lin
  • Computer Science, Mathematics
  • 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
DSNAS is proposed, an efficient differentiable NAS framework that simultaneously optimizes architecture and parameters with a low-biased Monte Carlo estimate and successfully discovers networks with comparable accuracy on ImageNet. Expand
Efficient Neural Architecture Search via Proximal Iterations
Different from DARTS, NASP reformulates the search process as an optimization problem with a constraint that only one operation is allowed to be updated during forward and backward propagation, and proposes a new algorithm inspired by proximal iterations to solve it. Expand
Reversible Architectures for Arbitrarily Deep Residual Neural Networks
From this interpretation, a theoretical framework on stability and reversibility of deep neural networks is developed, and three reversible neural network architectures that can go arbitrarily deep in theory are derived. Expand
SMASH: One-Shot Model Architecture Search through HyperNetworks
A technique to accelerate architecture selection by learning an auxiliary HyperNet that generates the weights of a main model conditioned on that model's architecture is proposed, achieving competitive performance with similarly-sized hand-designed networks. Expand
Practical Block-Wise Neural Network Architecture Generation
A block-wise network generation pipeline called BlockQNN which automatically builds high-performance networks using the Q-Learning paradigm with epsilon-greedy exploration strategy and offers tremendous reduction of the search space in designing networks which only spends 3 days with 32 GPUs. Expand
ISTA-NAS: Efficient and Consistent Neural Architecture Search by Sparse Coding
This paper performs the differentiable search on a compressed lower-dimensional space that has the same validation loss as the original sparse solution space, and recover an architecture by solving the sparse coding problem in an alternate manner. Expand