• Corpus ID: 236428967

Go Wider Instead of Deeper

@article{Xue2021GoWI,
  title={Go Wider Instead of Deeper},
  author={Fuzhao Xue and Ziji Shi and Futao Wei and Yuxuan Lou and Yong Liu and Yang You},
  journal={ArXiv},
  year={2021},
  volume={abs/2107.11817}
}
The transformer has recently achieved impressive results on various tasks. To further improve the effectiveness and efficiency of the transformer, there are two trains of thought among existing works: (1) going wider by scaling to more trainable parameters; (2) going shallower by parameter sharing or model compressing along with the depth. However, larger models usually do not scale well when fewer tokens are available to train, and advanced parallelisms are required when the model is extremely… 

Figures and Tables from this paper

Cross-token Modeling with Conditional Computation
TLDR
This work proposes Sparse-MLP, an all- MLP model which applies sparsely-activated MLPs to cross-token modeling and improves the model’s computational efficiency by proposing importance-score routing strategy for MoE and redesigning the image representation shape.
Are Transformers More Robust Than CNNs?
TLDR
This paper challenges the previous belief that Transformers outshine CNNs when measuring adversarial robustness, and finds CNNs can easily be as robust as Transformers on defending against adversarial attacks, if they properly adopt Transformers’ training recipes.
Sparse-MLP: A Fully-MLP Architecture with Conditional Computation
TLDR
Sparse-MLP is proposed, scaling the recent MLP-Mixer model with sparse MoE layers, to achieve a more computation-efficient architecture and can outperform dense MLP models with comparable parameters and less computational cost on several downstream image classification tasks.
Balancing Expert Utilization in Mixture-of-Experts Layers Embedded in CNNs
TLDR
This work addresses the problem of unbalanced expert utilization in sparsely-gated Mixture of Expert (MoE) layers, embedded directly into convolutional neural networks and presents both soft and hard constraint-based approaches.
B ALANCING E XPERT U TILIZATION IN M IXTURE - OF -E XPERTS L AYERS E MBEDDED IN CNN S
TLDR
This work addresses the problem of unbalanced expert utilization in sparsely-gated Mixture of Expert (MoE) layers, embedded directly into convolutional neural networks and presents both soft and hard constraint-based approaches.
C ONCURRENT A DVERSARIAL L EARNING FOR L ARGE B ATCH T RAINING
TLDR
By allowing the computation of adversarial examples using stale weights, the two sequential gradient computations in adversarial training can be decoupled, leading to fully parallelized computations at each step, resulting in the same iteration throughput as original SGD or Adam optimizers.
One Student Knows All Experts Know: From Sparse to Dense
TLDR
This work proposes a novel task, knowledge integration, to obtain a dense student model (OneS) as knowledgeable as one sparse MoE, and proposes Singular Value Decomposition Knowledge Gathering (SVD-KG) to gather key knowledge from different pretrained experts.
Are Vision Transformers Robust to Spurious Correlations?
TLDR
This study reveals that when pre-trained on a sufficiently large dataset, ViT models are more robust to spurious correlations than CNNs, and the role of the self-attention mechanism in providing robustness under spuriously correlated environments is understood.
Deeper vs Wider: A Revisit of Transformer Configuration
Transformer-based models have delivered impressive results on many tasks, partic-ularly vision and language tasks. In many model training situations, conventional configurations are typically adopted.

References

SHOWING 1-10 OF 37 REFERENCES
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
TLDR
This work simplifies the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs, and advances the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus”, and achieves a 4x speedup over the T5-XXL model.
Scaling Vision Transformers
TLDR
A ViT model with two billion parameters is successfully trained, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy and performs well on few-shot learning.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
TLDR
This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.
Reducing BERT Pre-Training Time from 3 Days to 76 Minutes
TLDR
The LAMB optimizer is proposed, which helps to scale the batch size to 65536 without losing accuracy, and is a general optimizer that works for both small and large batch sizes and does not need hyper-parameter tuning besides the learning rate.
Universal Transformers
TLDR
The Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the Transformer model and which addresses issues of parallelizability and global receptive field, is proposed.
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
TLDR
The empirical results demonstrate the superior performance of LAMB across various tasks such as BERT and ResNet-50 training with very little hyperparameter tuning, and the optimizer enables use of very large batch sizes of 32868 without any degradation of performance.
Exploring Sparse Expert Models and Beyond
TLDR
This work investigates several key factors in sparse expert models and proposes a simple method called expert prototyping that improves the model quality but maintains constant computational costs, and further exploration on extremely large-scale models reflects that it is more effective in training larger models.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
TLDR
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
TLDR
It is found that the performance on vision tasks increases logarithmically based on volume of training data size, and it is shown that representation learning (or pre-training) still holds a lot of promise.
...
1
2
3
4
...