• Corpus ID: 372467

Large Scale Distributed Deep Networks

  title={Large Scale Distributed Deep Networks},
  author={Jeffrey Dean and Gregory S. Corrado and Rajat Monga and Kai Chen and Matthieu Devin and Quoc V. Le and Mark Z. Mao and Marc'Aurelio Ranzato and Andrew W. Senior and Paul A. Tucker and Ke Yang and A. Ng},
Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. [] Key Method Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed…

Figures from this paper

Partitioning Large Scale Deep Belief Networks Using Dropout

This work considers a well-known machine learning model, deep belief networks (DBNs), and proposes an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU).

Reducing the training time of deep learning models using synchronous SGD and large batch size

The current state of the art for a modern distributed training framework is presented in this study, which covers the many methods and strategies utilized to distribute training and shows that using the same approaches, a smaller deep network is trained for an image classification problem in a shorter time.

Performance Modeling of Distributed Deep Neural Networks

This paper analyzes CNTK, one of the most commonly used DDNNs, by first building a performance model and then evaluating the system two settings: a small cluster with all nodes in a single rack connected to a top of rack switch, and in large scale using Blue Waters with arbitary placement of nodes.

A Hitchhiker's Guide On Distributed Training of Deep Neural Networks

An Efficient Method for Training Deep Learning Networks Distributed

This paper proposes a hierarchi- cal synchronous Stochastic Gradient Descent (SGD) strategy, which can make full use of hardware resources and greatly increase computationaliency, and integrates the LARS algorithm into the system.


  • Computer Science
  • 2021
This work proposes SWARM Parallelism — a model-parallel training algorithm designed for swarms of poorly connected, heterogeneous unreliable devices that creates temporary randomized pipelines between available nodes that are rebalanced in case of failure.

Distributed learning of deep feature embeddings for visual recognition tasks

Fine-tuning, where a pretrained model is used as the basis for further training, as well as use of pretrained models for learning deep feature embeddings are included.

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning

This paper presents a distributed paradigm on the parameter server framework called Dynamic Stale Synchronous Parallel (DSSP) which improves the state-of-the-art SSP paradigm by dynamically determining the staleness threshold at the run time by adapting the threshold per iteration at running time.

Performance Analysis and Comparison of Distributed Machine Learning Systems

A performance model of computation time and communication latency under three different system architectures: Parameter Server, peer-to-peer, and Ring allreduce is developed and found that the system architecture has a very significant effect on the performance of training.

Parallel Training of Deep Networks with Local Updates

This paper investigates how to continue scaling compute efficiently beyond the point of diminishing returns for large batches through local parallelism, a framework which parallelizes training of individual layers in deep networks by replacing global back Propagation with truncated layer-wise backpropagation.



Large-scale deep unsupervised learning using graphics processors

It is argued that modern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods.

Improving the speed of neural networks on CPUs

This paper uses speech recognition as an example task, and shows that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10× speedup over an unoptimized baseline and a 4× speed up over an aggressively optimized floating-point baseline at no cost in accuracy.

An Analysis of Single-Layer Networks in Unsupervised Feature Learning

The results show that large numbers of hidden nodes and dense feature extraction are critical to achieving high performance—so critical, in fact, that when these parameters are pushed to their limits, they achieve state-of-the-art performance on both CIFAR-10 and NORB using only a single layer of features.

Learning Multiple Layers of Features from Tiny Images

It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network.

Building high-level features using large scale unsupervised learning

Contrary to what appears to be a widely-held intuition, the experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not.

Scalable stacking and learning for building deep architectures

  • L. DengDong YuJohn C. Platt
  • Computer Science
    2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2012
The Deep Stacking Network (DSN) is presented, which overcomes the problem of parallelizing learning algorithms for deep architectures and provides a method of stacking simple processing modules in buiding deep architectures, with a convex learning problem in each module.

Distributed GraphLab: A Framework for Machine Learning in the Cloud

This paper develops graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency, and introduces fault tolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm.

Distributed Training Strategies for the Structured Perceptron

This paper investigates distributed training strategies for the structured perceptron as a means to reduce training times when computing clusters are available and looks at two strategies and provides convergence bounds for a particular mode of distributed structured perceptrons training based on iterative parameter mixing (or averaging).

Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking, and presents an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work.

Distributed delayed stochastic optimization

This work shows n-node architectures whose optimization error in stochastic problems-in spite of asynchronous delays-scales asymptotically as O(1/√nT) after T iterations, known to be optimal for a distributed system with n nodes even in the absence of delays.