• Corpus ID: 220301566

# Data Movement Is All You Need: A Case Study on Optimizing Transformers

@article{Ivanov2021DataMI,
title={Data Movement Is All You Need: A Case Study on Optimizing Transformers},
author={Andrei Ivanov and Nikoli Dryden and Tal Ben-Nun and Shigang Li and Torsten Hoefler},
journal={ArXiv},
year={2021},
volume={abs/2007.00072}
}
• Published 30 June 2020
• Computer Science
• ArXiv
Transformers have become widely used for language modeling and sequence learning tasks, and are one of the most important machine learning workloads today. Training one is a very compute-intensive task, often taking days or weeks, and significant attention has been given to optimizing transformers. Despite this, existing implementations do not efficiently utilize GPUs. We find that data movement is the key bottleneck when training. Due to Amdahl's Law and massive improvements in compute…
36 Citations

## Figures and Tables from this paper

### A Comprehensive Study of Vision Transformers on Dense Prediction Tasks

• Computer Science
VISIGRAPP
• 2022
The extensive empirical results show that the features generated by VTs are more robust to distribution shifts, natural corruptions, and adversarial attacks in both tasks, whereas CNNs perform better at higher image resolutions in object detection.

### WAVELET: EFFICIENT DNN TRAINING WITH TICK-TOCK SCHEDULING

• Computer Science
• 2021
Wavelet is presented, an efficient and generic approach that can fully utilize all the available on-device memory among GPUs involved in the same distributed training job, and achieves near optimal on- device memory usage by adopting a simple but novel scheduling scheme called Tick-Tock, which interleaves waves of peak memory usage among the accelerators.

### HammingMesh: A Network Topology for Large-Scale Deep Learning

• Computer Science
ArXiv
• 2022
Based on the workload analysis, HammingMesh is designed, a novel network topology that provides high bandwidth at low cost with high job scheduling with two dimensions of parallelism that will power future large-scale deep learning systems with extreme bandwidth requirements.

### Survey: Exploiting Data Redundancy for Optimization of Deep Learning

• Computer Science
ACM Computing Surveys
• 2022
This article surveys hundreds of recent papers on data redundancy, introduces a novel taxonomy to put the various techniques into a single categorization framework, and offers a comprehensive description of the main methods used for exploiting data redundancy in improving multiple kinds of DNNs on data.

### Not All GPUs Are Created Equal: Characterizing Variability in Large-Scale, Accelerator-Rich Systems

• Computer Science
ArXiv
• 2022
This artifact provides container speciﬁcations and scripts for reproducing experiments that measure and benchmark variability of GPUs across different applications, and can be used to run different benchmarks across machine learning, molecular dynamics and graph analytics.

### A Comprehensive Study of Real-Time Object Detection Networks Across Multiple Domains: A Survey

• Computer Science
• 2022
This extensive empirical study on multiple real-time detection networks on a wide range of datasets and report results on an extensive set of metrics can act as a guideline for the industrial community to make an way, generalize.

### Benchmarking Data Science: 12 Ways to Lie With Statistics and Performance on Parallel Computers

It is joked that 12 fallacies when focusing on compute performance that are frequently observed in practice and recommended to mitigate the danger are humorously discussed.

### Accelerating Scientific Workflows on HPC Platforms with In Situ Processing

• Computer Science
2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)
• 2022
A new execution engine that uses Decaf to manage communications within a sub-workflow (i.e., set of jobs) to optimize inter-job communications is proposed, to accelerate task-based scientific workflows managed by the Pegasus WMS, by replacing file communications with faster MPI messaging.

### Reducing communication in the conjugate gradient method: a case study on high-order finite elements

• Computer Science
PASC
• 2022
This work targets a notoriously communication-bound solver at the core of many high-performance applications, namely the conjugate gradient method (CG), and compares CG to two communication-reducing techniques, namely communication-avoiding and pipelined CG.

### DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

• Computer Science
ArXiv
• 2022
DeepSpeed Inference is presented, a comprehensive system solution for transformer model inference that reduces latency by up to 7.3 × over the state-of-the-art for latency oriented scenarios and increases throughput by over 1.5x for throughput oriented scenarios.

## References

SHOWING 1-10 OF 147 REFERENCES

### DeepSpeed

• Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
• 2020

### HuggingFace's Transformers: State-of-the-art Natural Language Processing

• Computer Science
ArXiv
• 2019
The \textit{Transformers} library is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.

### Mesh-TensorFlow: Deep Learning for Supercomputers

• Computer Science
NeurIPS
• 2018
Mesh-TensorFlow is introduced, a language for specifying a general class of distributed tensor computations and used to implement an efficient data-parallel, model-Parallel version of the Transformer sequence-to-sequence model, surpassing state of the art results on WMT'14 English- to-French translation task and the one-billion-word language modeling benchmark.

### BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

• Computer Science
NAACL
• 2019
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

### Attention is All you Need

• Computer Science
NIPS
• 2017
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

• 2020

### NVIDIA TensorRT

• 2020. [Online]. Available: https: //developer.nvidia.com/tensorrt
• 2020

• 2020

### ONNX Runtime

• 2020. [Online]. Available: https: //microsoft.github.io/onnxruntime/
• 2020

• 2020