• Corpus ID: 220301566

Data Movement Is All You Need: A Case Study on Optimizing Transformers

  title={Data Movement Is All You Need: A Case Study on Optimizing Transformers},
  author={Andrei Ivanov and Nikoli Dryden and Tal Ben-Nun and Shigang Li and Torsten Hoefler},
Transformers have become widely used for language modeling and sequence learning tasks, and are one of the most important machine learning workloads today. Training one is a very compute-intensive task, often taking days or weeks, and significant attention has been given to optimizing transformers. Despite this, existing implementations do not efficiently utilize GPUs. We find that data movement is the key bottleneck when training. Due to Amdahl's Law and massive improvements in compute… 

Figures and Tables from this paper

A Comprehensive Study of Vision Transformers on Dense Prediction Tasks

The extensive empirical results show that the features generated by VTs are more robust to distribution shifts, natural corruptions, and adversarial attacks in both tasks, whereas CNNs perform better at higher image resolutions in object detection.


Wavelet is presented, an efficient and generic approach that can fully utilize all the available on-device memory among GPUs involved in the same distributed training job, and achieves near optimal on- device memory usage by adopting a simple but novel scheduling scheme called Tick-Tock, which interleaves waves of peak memory usage among the accelerators.

HammingMesh: A Network Topology for Large-Scale Deep Learning

Based on the workload analysis, HammingMesh is designed, a novel network topology that provides high bandwidth at low cost with high job scheduling with two dimensions of parallelism that will power future large-scale deep learning systems with extreme bandwidth requirements.

Survey: Exploiting Data Redundancy for Optimization of Deep Learning

This article surveys hundreds of recent papers on data redundancy, introduces a novel taxonomy to put the various techniques into a single categorization framework, and offers a comprehensive description of the main methods used for exploiting data redundancy in improving multiple kinds of DNNs on data.

Not All GPUs Are Created Equal: Characterizing Variability in Large-Scale, Accelerator-Rich Systems

This artifact provides container specifications and scripts for reproducing experiments that measure and benchmark variability of GPUs across different applications, and can be used to run different benchmarks across machine learning, molecular dynamics and graph analytics.

A Comprehensive Study of Real-Time Object Detection Networks Across Multiple Domains: A Survey

This extensive empirical study on multiple real-time detection networks on a wide range of datasets and report results on an extensive set of metrics can act as a guideline for the industrial community to make an way, generalize.

Benchmarking Data Science: 12 Ways to Lie With Statistics and Performance on Parallel Computers

It is joked that 12 fallacies when focusing on compute performance that are frequently observed in practice and recommended to mitigate the danger are humorously discussed.

Accelerating Scientific Workflows on HPC Platforms with In Situ Processing

  • T. DoL. Pottier E. Deelman
  • Computer Science
    2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)
  • 2022
A new execution engine that uses Decaf to manage communications within a sub-workflow (i.e., set of jobs) to optimize inter-job communications is proposed, to accelerate task-based scientific workflows managed by the Pegasus WMS, by replacing file communications with faster MPI messaging.

Reducing communication in the conjugate gradient method: a case study on high-order finite elements

This work targets a notoriously communication-bound solver at the core of many high-performance applications, namely the conjugate gradient method (CG), and compares CG to two communication-reducing techniques, namely communication-avoiding and pipelined CG.

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

DeepSpeed Inference is presented, a comprehensive system solution for transformer model inference that reduces latency by up to 7.3 × over the state-of-the-art for latency oriented scenarios and increases throughput by over 1.5x for throughput oriented scenarios.




HuggingFace's Transformers: State-of-the-art Natural Language Processing

The \textit{Transformers} library is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.

Mesh-TensorFlow: Deep Learning for Supercomputers

Mesh-TensorFlow is introduced, a language for specifying a general class of distributed tensor computations and used to implement an efficient data-parallel, model-Parallel version of the Transformer sequence-to-sequence model, surpassing state of the art results on WMT'14 English- to-French translation task and the one-billion-word language modeling benchmark.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Apex (A PyTorch Extension)

  • 2020


  • 2020. [Online]. Available: https: //developer.nvidia.com/tensorrt
  • 2020

ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters

  • 2020

ONNX Runtime

  • 2020. [Online]. Available: https: //microsoft.github.io/onnxruntime/
  • 2020

CUTLASS: CUDA templates for linear algebra subroutines

  • 2020