• Corpus ID: 231802364

Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

  title={Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models},
  author={Shang Wang and Peiming Yang and Yuxuan Zheng and X. Li and Gennady Pekhimenko},
Driven by the tremendous effort in researching novel deep learning (DL) algorithms, the training cost of developing new models increases staggeringly in recent years. We analyze GPU cluster usage statistics from a top research institute for more insights into the hardware efficiency achieved by typical DL training jobs. Our study reveals that single-accelerator training jobs can dominate the cluster-wide resource consumption when launched repetitively (e.g., for hyper-parameter tuning) while… 

Tear Up the Bubble Boom: Lessons Learned From a Deep Learning Research and Development Cluster

A detailed workload characterization of an R&D cluster, CloudBrain-I, in a research institute, Peng Cheng Laboratory, finds a severe problem for R&d clusters, resource underutilization, which is especially important in R& D clusters while not characterised by existing works.

A Study on the Intersection of GPU Utilization and CNN Inference

This study makes the case that there is room to improve the inference-time GPU utilization of CNNs and that knowledge of GPU utilization has the potential to benefit even applications that do not target utilization itself.

Deep Learning Training on Multi-Instance GPUs

The results demonstrate that employing MIG can significantly improve the utilization of the GPU when the workload is too small to utilize the whole GPU in isolation, and observe that training models in parallel using separate MIG partitions does not exhibit interference underlining the value of having a functionality like MIG on modern GPUs.



Accelerating Deep Learning Workloads Through Efficient Multi-Model Execution

HiveMind, a system designed specifically to optimize multi-model deep learning workloads, is proposed, which can accelerate simple hyperparameter tuning and multi- model inference workloads by up to 10× on NVIDIA P100 and V100 GPUs compared to sequential model execution.

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

TVM is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends and automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations.

Benchmarking and Analyzing Deep Neural Network Training

This work proposes a new benchmark suite for DNN training, called TBD, and presents a new toolchain for performance analysis for these models that combines the targeted usage of existing performance analysis tools, careful selection of performance metrics, and methodologies to analyze the results.

Understanding and optimizing packed neural network training for hyper-parameter tuning

This paper proposes a primitive for jointly training multiple neural network models on a single GPU, called pack, and presents a comprehensive empirical study of pack and end-to-end experiments that suggest significant improvements for hyperparameter tuning.

Training Deep Nets with Sublinear Memory Cost

This work designs an algorithm that costs O( √ n) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch, and shows that it is possible to trade computation for memory giving a more memory efficient training algorithm with a little extra computation cost.

Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training

It is shown that Daydream is able to model most mainstream DNN optimization techniques, and accurately predict the efficacy of optimizations that will result in significant performance improvements.

DAWNBench : An End-to-End Deep Learning Benchmark and Competition

DAWNBench is introduced, a benchmark and competition focused on end-to-end training time to achieve a state-of-the-art accuracy level, as well as inference with that accuracy, and will provide a useful, reproducible means of evaluating the many tradeoffs in deep learning systems.

Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks

Performing Deep Neural Network (DNN) computation on hardware accelerators efficiently is challenging. Existing DNN frameworks and compilers often treat the DNN operators in a data flow graph (DFG) as

Echo: Compiler-based GPU Memory Footprint Reduction for LSTM RNN Training

Echo is a new compiler-based optimization scheme that addresses the first challenge with a practical mechanism that estimates the memory benefits of recomputation over the entire computation graph, and the second challenge by non-conservatively estimating the recomPUTation runtime overhead leveraging layer specifics.

Why does no one use advanced hyperparameter tuning

  • https://determined.ai/blog/why -does-no-one-use-advanced-hp-tuning/,
  • 2020