• Corpus ID: 52920970

TicTac: Accelerating Distributed Deep Learning with Communication Scheduling

@article{Hashemi2019TicTacAD,
  title={TicTac: Accelerating Distributed Deep Learning with Communication Scheduling},
  author={Sayed Hadi Hashemi and Sangeetha Abdu Jyothi and Roy H. Campbell},
  journal={arXiv: Distributed, Parallel, and Cluster Computing},
  year={2019}
}
State-of-the-art deep learning systems rely on iterative distributed training to tackle the increasing complexity of models and input data. The iteration time in these communication-heavy systems depends on the computation time, communication time and the extent of overlap of computation and communication. In this work, we identify a shortcoming in systems with graph representation for computation, such as TensorFlow and PyTorch, that result in high variance in iteration time --- random order… 
Exploiting Simultaneous Communications to Accelerate Data Parallel Distributed Deep Learning
  • S. Shi, X. Chu, Bo Li
  • Computer Science
    IEEE INFOCOM 2021 - IEEE Conference on Computer Communications
  • 2021
TLDR
This paper forms an optimization problem of minimizing the training iteration time, in which both tensor fusion and simultaneous communications are allowed, and develops an efficient optimal scheduling solution and implements the distributed training algorithm ASC-WFBP.
Throughput Prediction of Asynchronous SGD in TensorFlow
TLDR
This paper presents a solution to predicting training throughput from profiling traces collected from a single-node configuration, able to model the interaction of multiple nodes and the scheduling of concurrent transmissions between the parameter server and each node.
Modeling and Optimizing the Scaling Performance in Distributed Deep Learning Training
TLDR
A recursive model, OSF (Scaling Factor considering Overlap), is proposed for estimating the scaling performance of DDL training of neural network models, given the settings of the DDL system, and the proposed adaptive tensor fusion improves the scaled performance by 32.2%∼ 150% compared to the constant Tensor fusion buffer size.
Dissecting the Communication Latency in Distributed Deep Sparse Learning
TLDR
This paper measures the Alibaba's DDL system, and reveals the major contributors of the latency, including concurrent write/read operations of different connections and network connection management.
Communication-Efficient Distributed Deep Learning: A Comprehensive Survey
TLDR
A comprehensive survey of the communication-efficient distributed training algorithms in both system-level and algorithmic-level optimizations is provided, which provides the readers to understand what algorithms are more efficient under specific distributed environments and extrapolate potential directions for further optimizations.
Geryon: Accelerating Distributed CNN Training by Network-Level Flow Scheduling
TLDR
Geryon is presented, a network-level flow scheduling scheme to accelerate distributed Convolutional Neural Network (CNN) training that leverages multiple flows with different priorities to transfer parameters of different urgency levels, which can naturally coordinate multiple parameter servers and prioritize the urgent parameter transfers in the entire network fabric.
Preemptive All-reduce Scheduling for Expediting Distributed DNN Training
TLDR
PACE is proposed, a communication scheduler that preemptively schedules (potentially fused) all-reduce tensors based on the DAG of DNN training, guaranteeing maximal overlapping of communication with computation and high bandwidth utilization.
Green, Yellow, Yield: End-Host Traffic Scheduling for Distributed Deep Learning with TensorLights
  • X. Huang, Ang Chen, T. Ng
  • Computer Science
    2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
  • 2019
TLDR
TensorLights is proposed, which introduces traffic prioritization at host NICs to manage traffic contention among PSes and effectively mitigates stragglers, improves the average completion time of DL applications by up to 31%, and increases resource utilization.
Communication Optimization Strategies for Distributed Deep Learning: A Survey
TLDR
A comprehensive survey of communication strategies from both algorithm and computer network perspectives is given, including how to reduce the number of communication rounds and transmitted bits per round, and shed light on how to overlap computation and communication.
ByteComp: Revisiting Gradient Compression in Distributed Training
Gradient compression (GC) is a promising approach to ad-dressing the communication bottleneck in distributed deep learning (DDL). However, it is challenging to find the optimal compression strategy
...
...

References

SHOWING 1-10 OF 37 REFERENCES
TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning
TLDR
This work mathematically proves the convergence of TernGrad under the assumption of a bound on gradients, and proposes layer-wise ternarizing and gradient clipping to improve its convergence.
Horovod: fast and easy distributed deep learning in TensorFlow
TLDR
Horovod is an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow.
TensorFlow: A system for large-scale machine learning
TLDR
The TensorFlow dataflow model is described and the compelling performance that Tensor Flow achieves for several real-world applications is demonstrated.
GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server
TLDR
GeePS enables a state-of-the-art single-node GPU implementation to scale well, such as to 13 times the number of training images processed per second on 16 machines (relative to the original optimized single- node code), and achieves a higher training throughput with just four GPU machines than that a state of theart CPU-only system achieves with 108 machines.
On Scale-out Deep Learning Training for Cloud and HPC
TLDR
The philosophy, design, and implementation of Intel Machine Learning Scalability Library (MLSL) are described and proof-points demonstrating scaling DL training on 100s to 1000s of nodes across Cloud and HPC systems are presented.
FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters
TLDR
FireCaffe is presented, which successfully scales deep neural network training across a cluster of GPUs, and finds that reduction trees are more efficient and scalable than the traditional parameter server approach.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
TLDR
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Rethinking the Inception Architecture for Computer Vision
TLDR
This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.
Improving the speed of neural networks on CPUs
TLDR
This paper uses speech recognition as an example task, and shows that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10× speedup over an unoptimized baseline and a 4× speed up over an aggressively optimized floating-point baseline at no cost in accuracy.
QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding
TLDR
Quantized SGD is proposed, a family of compression schemes for gradient updates which provides convergence guarantees and leads to significant reductions in end-to-end training time, and can be extended to stochastic variance-reduced techniques.
...
...