Scaling deep learning on GPU and knights landing clusters

  title={Scaling deep learning on GPU and knights landing clusters},
  author={Yang You and Aydın Buluç and James Demmel},
  journal={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
  • Yang You, A. Buluç, J. Demmel
  • Published 9 August 2017
  • Computer Science
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Training neural networks has become a big bottleneck. For example, training ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs. We use both self-host Intel Knights Landing (KNL) clusters and multi-GPU clusters as our target platforms. From the algorithm aspect, we focus on Elastic Averaging SGD (EASGD) to… 

Detailed Performance Analysis of Distributed Tensorflow on a GPU Cluster using Deep Learning Algorithms

A detailed performance analysis of distributed Tensorflow using Horovod gives almost linear throughput (images/sec) scalability up to 256 GPUs and implements distributed learning for AlexNet, GoogleNet, and ResNet50 using Horvod.

Evaluation of On-Node GPU Interconnects for Training Deep Neural Networks

This thesis evaluates the performance of different on-node GPU interconnects: PCIe and NVLink for basic operations involved in training deep neural networks.

A Quantitative Study of Deep Learning Training on Heterogeneous Supercomputers

This paper evaluates well-known DL models with large-scale datasets using the popular TensorFlow framework, and provides a thorough evaluation including scalability, accuracy, variability, storage resource, GPU-GPU/GPU-CPU data transfer, and GPU utilization.

Efficient MPI‐AllReduce for large‐scale deep learning on GPU‐clusters

Two hierarchical distributed memory multileader AllReduce algorithms optimized for GPU‐accelerated clusters are proposed, in which GPUs inside a computing node perform an intra‐node communication phase to gather and store results of local reduced values to designated GPUs (known as node leaders).

Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads

Two hierarchical distributed-memory multi-leader allreduce algorithms optimized for GPU-accelerated clusters, named lr_lr and lr-rab, are exploited and can cut down the execution time of an Allreduce microbenchmark that uses logical ring algorithm (lr).

Reducing Data Motion to Accelerate the Training of Deep Neural Networks

An algorithm to dynamically adapt the data representation format of network weights during training is proposed to reduce the cost of DNNs training by decreasing the amount of data movement across heterogeneous architectures composed of several GPUs and multicore CPUs.

Communication-Efficient Distributed Deep Learning with Merged Gradient Sparsification on GPUs

  • S. ShiQiang Wang Xin Zhao
  • Computer Science
    IEEE INFOCOM 2020 - IEEE Conference on Computer Communications
  • 2020
The trade-off between communications and computations (including backward computation and gradient sparsification) is formulated as an optimization problem, and an optimal solution to the problem is derived.

An Overview of Efficient Interconnection Networks for Deep Neural Network Accelerators

This paper provides a comprehensive investigation of the recent advances in efficient on-chip interconnection and design methodology of the DNN accelerator design, and investigates the emerging interconnection technologies (e.g., in/near-memory processing) for the Dnn accelerator design.



Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs

This work studies the memory efficiency of various CNN layers and reveals the performance implication from both data layouts and memory access patterns, which shows the universal effect of the proposed optimizations on both single layers and various networks.

Deep learning with COTS HPC systems

This paper presents technical details and results from their own system based on Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with Infiniband interconnects and MPI, and shows that it can scale to networks with over 11 billion parameters using just 16 machines.

FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters

FireCaffe is presented, which successfully scales deep neural network training across a cluster of GPUs, and finds that reduction trees are more efficient and scalable than the traditional parameter server approach.

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs

This work shows empirically that in SGD training of deep neural networks, one can, at no or nearly no loss of accuracy, quantize the gradients aggressively—to but one bit per value—if the quantization error is carried forward across minibatches (error feedback), and implements data-parallel deterministically distributed SGD by combining this finding with AdaGrad.

Large Scale Distributed Deep Networks

This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.

Deep Learning with Limited Numerical Precision

The results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy.

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

A binary matrix multiplication GPU kernel is programmed with which it is possible to run the MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.

CA-SVM: Communication-Avoiding Support Vector Machines on Distributed Systems

This study considers a series of algorithmic refinements, leading ultimately to a Communication-Avoiding SVM (CASVM) method that improves the is efficiency to nearly W = Omega(P), better than even a one-dimensional block row dense matrix vector multiplication.

Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking, and presents an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work.

ImageNet classification with deep convolutional neural networks

A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.