Scaling deep learning on GPU and knights landing clusters

  title={Scaling deep learning on GPU and knights landing clusters},
  author={Yang You and Aydın Buluç and James Demmel},
  journal={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
  • Yang You, A. Buluç, J. Demmel
  • Published 2017
  • Computer Science
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Training neural networks has become a big bottleneck. For example, training ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs. We use both self-host Intel Knights Landing (KNL) clusters and multi-GPU clusters as our target platforms. From the algorithm aspect, we focus on Elastic Averaging SGD (EASGD) to… Expand
On Linear Learning with Manycore Processors
This paper proposes a novel approach for achieving parallelism which is called Heterogeneous Tasks on Homogeneous Cores (HTHC), and divides the problem into multiple fundamentally different tasks, which themselves are parallelized. Expand
Detailed Performance Analysis of Distributed Tensorflow on a GPU Cluster using Deep Learning Algorithms
A detailed performance analysis of distributed Tensorflow using Horovod gives almost linear throughput (images/sec) scalability up to 256 GPUs and implements distributed learning for AlexNet, GoogleNet, and ResNet50 using Horvod. Expand
Evaluation of On-Node GPU Interconnects for Training Deep Neural Networks
This thesis evaluates the performance of different on-node GPU interconnects: PCIe and NVLink for basic operations involved in training deep neural networks. Expand
A Quantitative Study of Deep Learning Training on Heterogeneous Supercomputers
This paper evaluates well-known DL models with large-scale datasets using the popular TensorFlow framework, and provides a thorough evaluation including scalability, accuracy, variability, storage resource, GPU-GPU/GPU-CPU data transfer, and GPU utilization. Expand
A survey of techniques for optimizing deep learning on GPUs
A survey of architecture and system-level techniques for optimizing DL applications on GPUs for both inference and training and for both single GPU and distributed system with multiple GPUs is presented. Expand
Efficient MPI‐AllReduce for large‐scale deep learning on GPU‐clusters
Two hierarchical distributed memory multileader AllReduce algorithms optimized for GPU‐accelerated clusters are proposed, in which GPUs inside a computing node perform an intra‐node communication phase to gather and store results of local reduced values to designated GPUs (known as node leaders). Expand
Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads
Two hierarchical distributed-memory multi-leader allreduce algorithms optimized for GPU-accelerated clusters, named lr_lr and lr-rab, are exploited and can cut down the execution time of an Allreduce microbenchmark that uses logical ring algorithm (lr). Expand
Reducing Data Motion to Accelerate the Training of Deep Neural Networks
An algorithm to dynamically adapt the data representation format of network weights during training is proposed to reduce the cost of DNNs training by decreasing the amount of data movement across heterogeneous architectures composed of several GPUs and multicore CPUs. Expand
Communication-Efficient Distributed Deep Learning with Merged Gradient Sparsification on GPUs
The trade-off between communications and computations (including backward computation and gradient sparsification) is formulated as an optimization problem, and an optimal solution to the problem is derived. Expand
An Overview of Efficient Interconnection Networks for Deep Neural Network Accelerators
This paper provides a comprehensive investigation of the recent advances in efficient on-chip interconnection and design methodology of the DNN accelerator design, and investigates the emerging interconnection technologies (e.g., in/near-memory processing) for the Dnn accelerator design. Expand