• Corpus ID: 52920970

TicTac: Accelerating Distributed Deep Learning with Communication Scheduling

  title={TicTac: Accelerating Distributed Deep Learning with Communication Scheduling},
  author={Sayed Hadi Hashemi and Sangeetha Abdu Jyothi and Roy H. Campbell},
  journal={arXiv: Distributed, Parallel, and Cluster Computing},
State-of-the-art deep learning systems rely on iterative distributed training to tackle the increasing complexity of models and input data. The iteration time in these communication-heavy systems depends on the computation time, communication time and the extent of overlap of computation and communication. In this work, we identify a shortcoming in systems with graph representation for computation, such as TensorFlow and PyTorch, that result in high variance in iteration time --- random order… 

Throughput Prediction of Asynchronous SGD in TensorFlow

This paper presents a solution to predicting training throughput from profiling traces collected from a single-node configuration, able to model the interaction of multiple nodes and the scheduling of concurrent transmissions between the parameter server and each node.

Modeling and Optimizing the Scaling Performance in Distributed Deep Learning Training

A recursive model, OSF (Scaling Factor considering Overlap), is proposed for estimating the scaling performance of DDL training of neural network models, given the settings of the DDL system, and the proposed adaptive tensor fusion improves the scaled performance by 32.2%∼ 150% compared to the constant Tensor fusion buffer size.

Dissecting the Communication Latency in Distributed Deep Sparse Learning

This paper measures the Alibaba's DDL system, and reveals the major contributors of the latency, including concurrent write/read operations of different connections and network connection management.

Communication-Efficient Distributed Deep Learning: A Comprehensive Survey

A comprehensive survey of the communication-efficient distributed training algorithms in both system-level and algorithmic-level optimizations is provided, which provides the readers to understand what algorithms are more efficient under specific distributed environments and extrapolate potential directions for further optimizations.

Preemptive All-reduce Scheduling for Expediting Distributed DNN Training

PACE is proposed, a communication scheduler that preemptively schedules (potentially fused) all-reduce tensors based on the DAG of DNN training, guaranteeing maximal overlapping of communication with computation and high bandwidth utilization.

Green, Yellow, Yield: End-Host Traffic Scheduling for Distributed Deep Learning with TensorLights

  • X. HuangAng ChenT. Ng
  • Computer Science
    2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
  • 2019
TensorLights is proposed, which introduces traffic prioritization at host NICs to manage traffic contention among PSes and effectively mitigates stragglers, improves the average completion time of DL applications by up to 31%, and increases resource utilization.

Communication Optimization Strategies for Distributed Deep Learning: A Survey

A comprehensive survey of communication strategies from both algorithm and computer network perspectives is given, including how to reduce the number of communication rounds and transmitted bits per round, and shed light on how to overlap computation and communication.

Mercury: A Simple Transport Layer Scheduler to Accelerate Distributed DNN Training

Mercury is a simple transport layer scheduler that does not partition the tensors, but moves the priority scheduling to the transport layer at the packet granularity, which achieves the near-optimal overlapping between communication and computation.

A generic communication scheduler for distributed DNN training acceleration

This work introduces a unified abstraction and a Dependency Proxy mechanism to enable communication scheduling without breaking the original dependencies in framework engines, and introduces a Bayesian Optimization approach to auto-tune tensor partition size and other parameters for different training models under various networking conditions.



Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Poseidon exploits the layered model structures in DL programs to overlap communication and computation, reducing bursty network communication and is applicable to different DL frameworks by plugging Poseidon into Caffe and TensorFlow.

TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

This work mathematically proves the convergence of TernGrad under the assumption of a bound on gradients, and proposes layer-wise ternarizing and gradient clipping to improve its convergence.

Horovod: fast and easy distributed deep learning in TensorFlow

Horovod is an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow.

GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server

GeePS enables a state-of-the-art single-node GPU implementation to scale well, such as to 13 times the number of training images processed per second on 16 machines (relative to the original optimized single- node code), and achieves a higher training throughput with just four GPU machines than that a state of theart CPU-only system achieves with 108 machines.

On Scale-out Deep Learning Training for Cloud and HPC

The philosophy, design, and implementation of Intel Machine Learning Scalability Library (MLSL) are described and proof-points demonstrating scaling DL training on 100s to 1000s of nodes across Cloud and HPC systems are presented.

FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters

FireCaffe is presented, which successfully scales deep neural network training across a cluster of GPUs, and finds that reduction trees are more efficient and scalable than the traditional parameter server approach.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

Rethinking the Inception Architecture for Computer Vision

This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.

Improving the speed of neural networks on CPUs

This paper uses speech recognition as an example task, and shows that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10× speedup over an unoptimized baseline and a 4× speed up over an aggressively optimized floating-point baseline at no cost in accuracy.

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

Quantized SGD is proposed, a family of compression schemes for gradient updates which provides convergence guarantees and leads to significant reductions in end-to-end training time, and can be extended to stochastic variance-reduced techniques.