Corpus ID: 225067709

Stochastic Optimization with Laggard Data Pipelines

@article{Agarwal2020StochasticOW,
  title={Stochastic Optimization with Laggard Data Pipelines},
  author={Naman Agarwal and Rohan Anil and Tomer Koren and Kunal Talwar and Cyril Zhang},
  journal={ArXiv},
  year={2020},
  volume={abs/2010.13639}
}
State-of-the-art optimization is steadily shifting towards massively parallel pipelines with extremely large batch sizes. As a consequence, CPU-bound preprocessing and disk/memory/network operations have emerged as new performance bottlenecks, as opposed to hardware-accelerated gradient computations. In this regime, a recently proposed approach is data echoing (Choi et al., 2019), which takes repeated gradient steps on the same batch while waiting for fresh data to arrive from upstream. We… Expand

Figures and Tables from this paper

Progressive Compressed Records: Taking a Byte out of Deep Learning Data
TLDR
The results show that the amount of compression a dataset can tolerate depends on the training task, and PCRs enable tasks to readily access appropriate levels of compression at runtime---resulting in a 2x speedup in training time on average over baseline formats. Expand
Large-Scale Differentially Private BERT
TLDR
This work studies the large-scale pretraining of BERT-Large with differentially private SGD (DP-SGD), and shows that scaling up the batch size to millions improves the utility of the DP- SGD step for BERT and enhances its efficiency by using an increasing batch size schedule. Expand
The Instability of Accelerated Gradient Descent
TLDR
It is shown that the stability of Nesterov’s accelerated gradient method in fact deteriorates exponentially fast with the number of gradient steps, which stands in sharp contrast to the bounds in the quadratic case, but also to known results for non-accelerated gradient methods where stability typically grows linearly with theNumber of steps. Expand
Acceleration via Fractal Learning Rate Schedules
TLDR
This work reinterprets an iterative algorithm from the numerical analysis literature as what it is called the Chebyshev learning rate schedule for accelerating vanilla gradient descent, and shows that the problem of mitigating instability leads to a fractal ordering of step sizes. Expand
Algorithmic Instabilities of Accelerated Gradient Descent
TLDR
It is shown that the stability of Nesterov’s accelerated gradient method in fact deteriorates exponentially fast with the number of gradient steps, which stands in sharp contrast to the bounds in the quadratic case, but also to known results for non-accelerated gradient methods where stability typically grows linearly with theNumber of steps. Expand

References

SHOWING 1-10 OF 50 REFERENCES
Faster Neural Network Training with Data Echoing
TLDR
This paper introduces "data echoing," which reduces the total computation used by earlier pipeline stages and speeds up training whenever computation upstream from accelerators dominates the training time. Expand
Measuring the Effects of Data Parallelism on Neural Network Training
TLDR
This work experimentally characterize the effects of increasing the batch size on training time, as measured by the number of steps necessary to reach a goal out-of-sample error, and study how this relationship varies with the training algorithm, model, and data set, and finds extremely large variation between workloads. Expand
Extreme Tensoring for Low-Memory Preconditioning
TLDR
This work proposes extreme tensoring for high-dimensional stochastic optimization, showing that an optimizer needs very little memory to benefit from adaptive preconditioning. Expand
Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis.
TLDR
The problem of parallelization in DNNs is described from a theoretical perspective, followed by approaches for its parallelization, and potential directions for parallelism in deep learning are extrapolated. Expand
Memory and Communication Efficient Distributed Stochastic Optimization with Minibatch Prox
TLDR
This work presents and analyzes an approach for distributed stochastic optimization which is statistically optimal and achieves near-linear speedups (up to logarithmic factors) and provides a novel analysis for such a minibatch-prox procedure which achieves the statistical optimal rate regardless of minibatches size and smoothness, thus significantly improving on prior work. Expand
Image Classification at Supercomputer Scale
TLDR
Three systems-related optimizations are discussed: (1) distributed batch normalization to control per-replica batch sizes, (2) input pipeline optimizations to sustain model throughput, and (3) 2-D torus all-reduce to speed up gradient summation. Expand
In-datacenter performance analysis of a tensor processing unit
  • N. Jouppi, C. Young, +73 authors D. Yoon
  • Computer Science
  • 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA)
  • 2017
TLDR
This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters. Expand
Memory-Efficient Adaptive Optimization for Large-Scale Learning
TLDR
This work describes a novel, simple, and flexible adaptive optimization method with sublinear memory cost that retains the benefits of per-parameter adaptivity while allowing for larger models and mini-batches and gives convergence guarantees for the method. Expand
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
TLDR
This work demonstrates empirically that adaptive methods can produce larger-than-desired updates when the decay rate of the second moment accumulator is too slow, and proposes update clipping and a gradually increasing decay rate scheme as remedies. Expand
Augment your batch: better training with larger batches
TLDR
Batch augmentation acts as a regularizer and an accelerator, increasing both generalization and performance scaling in large-batch SGD, and empirically improves convergence for a wide variety of deep neural networks and datasets. Expand
...
1
2
3
4
5
...