DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation

  title={DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation},
  author={Yu Tang and Chenyu Wang and Yufan Zhang and Yuliang Liu and Xingcheng Zhang and Linbo Qiao and Zhiquan Lai and Dongsheng Li},
—Training activations of deep neural networks occupy plenty of GPU memory, especially for large-scale deep neural networks. However, the further development of deep neural networks is hampered by the limited GPU memory resource. Therefore, the optimal utilization of GPU memory resources is highly demanded. Swapping and recomputation are commonly applied to make better use of GPU memory in deep learning. As an emerging domain, several dilemmas remain: 1) The efficiency of recomputation is limited… 


Superneurons: dynamic GPU memory management for training deep neural networks
This work presents SuperNeurons: a dynamic GPU memory scheduling runtime to enable the network training far beyond the GPU DRAM capacity, which can train ResNet2500 that has 104 basic network layers on a 12GB K40c and dynamically allocates the memory for convolution workspaces to achieve the high performance.
Capuchin: Tensor-based GPU Memory Management for Deep Learning
Capuchin is proposed, a tensor-based GPU memory management module that reduces the memory footprint via tensor eviction/prefetching and recomputation and makes memory management decisions based on dynamic tensor access pattern tracked at runtime.
SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping
This work proposes SwapAdvisor, which performs joint optimization along 3 dimensions based on a given dataflow graph: operator scheduling, memory allocation, and swap decisions, and can train models up to 12 times the GPU memory limit while achieving 53-99% of the throughput of a hypothetical baseline with infinite GPU memory.
Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks
A high-performance virtualization strategy based on a "compressing DMA engine" (cDMA) that drastically reduces the size of the data structures that are targeted for CPU-side allocations is introduced.
vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design
The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU. This restriction
Training Deep Nets with Sublinear Memory Cost
This work designs an algorithm that costs O( √ n) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch, and shows that it is possible to trade computation for memory giving a more memory efficient training algorithm with a little extra computation cost.
ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning
ZeRO-Infinity is presented, a novel heterogeneous system technology that leverages GPU, CPU, and NVMe memory to allow for unprecedented model scale on limited resources without requiring model code refactoring, and achieves excellent training throughput and scalability, unencumbered by the limited CPU or NVMe bandwidth.
Efficient Combination of Rematerialization and Offloading for Training DNNs
The experiments show that the possibility to offload can remove one third of the overhead of rematerialization, and that together they can reduce the memory used for activations by a factor 4 to 6, with an overhead below 20%.
In-place Activated BatchNorm for Memory-Optimized Training of DNNs
This work presents In-Place Activated Batch Normalization (INPLACE-ABN), a novel approach to drastically reduce the training memory footprint of modern deep neural networks in a computationally efficient way, hence avoiding invasive framework surgery while providing straightforward applicability for existing deep learning frameworks.
EIE: Efficient Inference Engine on Compressed Deep Neural Network
  • Song Han, Xingyu Liu, W. Dally
  • Computer Science
    2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
  • 2016
An energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing and is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression.