Convolutional Neural Network Training with Distributed K-FAC

@inproceedings{Pauloski2020ConvolutionalNN,
  title={Convolutional Neural Network Training with Distributed K-FAC},
  author={J. Gregory Pauloski and Z. Zhang and Lei Huang and Weijia Xu and Ian T. Foster},
  booktitle={SC},
  year={2020}
}
Training neural networks with many processors can reduce time-to-solution; however, it is challenging to maintain convergence and efficiency at large scales. The Kronecker-factored Approximate Curvature (K-FAC) was recently proposed as an approximation of the Fisher Information Matrix that can be used in natural gradient optimizers. We investigate here a scalable K-FAC design and its applicability in convolutional neural network (CNN) training at scale. We study optimization techniques such as… Expand
KAISA: an adaptive second-order optimizer framework for deep neural networks
TLDR
KAISA is presented, a K-FAC-enabled, Adaptable, Improved, and ScAlable second-order optimizer framework that adapts the memory footprint, communication, and computation given specific models and hardware to improve performance and increase scalability. Expand
An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks
TLDR
This work analyzes the compute, communication, and memory requirements of Convolutional Neural Networks (CNNs) to understand the trade-offs between different parallelism approaches on performance and scalability and concludes that the oracle has an average accuracy of about 86.74% when compared to empirical results, and as high as 97.57% for data parallelism. Expand
An iterative K-FAC algorithm for Deep Learning
TLDR
It is proved that the time and memory complexity of iterative CG-FAC is much less than that of standard K-FAS, and uses conjugate gradient method to approximate the nature gradient. Expand
A Trace-restricted Kronecker-Factored Approximation to Natural Gradient
TLDR
A new approximation to the Fisher information matrix (FIM) called Trace-restricted Kronecker-factored Approximate Curvature (TKFAC) is proposed in this work, which can hold the certain trace relationship between the exact and the approximate FIM. Expand
COMET: A Novel Memory-Efficient Deep Learning Training Framework by Using Error-Bounded Lossy Compression
  • Sian Jin, Chengming Zhang, +5 authors Dingwen Tao
  • Computer Science
  • ArXiv
  • 2021
TLDR
This paper proposes a novel memory-efficient CNN training framework that leverages error-bounded lossy compression to significantly reduce the memory requirement for training in order to allow training larger models or to accelerate training. Expand
Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks
  • S. Shi, Lin Zhang, Bo Li
  • Computer Science
  • 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)
  • 2021
TLDR
This work proposes D-KFAC (SPD-KfAC) with smart parallelism of computing and communication tasks to reduce the iteration time, and first characterize the performance bottlenecks of D-FAC. Expand
Eigenvalue-corrected Natural Gradient Based on a New Approximation
TLDR
The proposed Trace-restricted Eigenvalue-corrected Kronecker Factorization (TEKFAC) corrects the inexact re-scaling factor under the Kroneker-factored eigenbasis, and considers the new approximation method and the effective damping technique proposed in Gao et al. (2020). Expand
Large-Scale Deep Learning Optimizations: A Comprehensive Survey
  • Xiaoxin He, Fuzhao Xue, Xiaozhe Ren, Yang You
  • Computer Science
  • ArXiv
  • 2021
TLDR
This survey aims to provide a clear sketch about the optimizations for large-scale deep learning with regard to the model accuracy and model efficiency, and investigates algorithms that are most commonly used for optimizing. Expand
Models and Processes to Extract Drug-like Molecules From Natural Language Text
TLDR
A iterative model-in-the-loop method that makes judicious use of scarce human expertise in generating training data for a NER model, and the application and evaluation of this method to the problem of identifying drug-like molecules in the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of 198,875 papers. Expand
GIST: Distributed Training for Large-Scale Graph Convolutional Networks
TLDR
GIST is a hybrid layer and graph sampling method, which disjointly partitions the global model into several, smaller sub-GCNs that are independently trained across multiple GPUs in parallel, which improves model performance and significantly decreases wall-clock training time. Expand

References

SHOWING 1-10 OF 43 REFERENCES
Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks
TLDR
This work proposes an alternative approach using a second order optimization method that shows similar generalization capability to first order methods, but converges faster and can handle larger mini-batches. Expand
ImageNet Training in Minutes
TLDR
This paper empirically evaluates the effectiveness on two neural networks: AlexNet and ResNet-50 trained with the ImageNet-1k dataset while preserving the state-of-the-art test accuracy, and uses large batch size, powered by the Layer-wise Adaptive Rate Scaling (LARS) algorithm, for efficient usage of massive computing resources. Expand
Kronecker-factored Curvature Approximations for Recurrent Neural Networks
TLDR
This work extends the K-FAC method to handle RNNs by introducing a novel approximation to the FIM for FIM, and demonstrates that this method significantly outperforms general purpose state-of-the-art optimizers like SGD with momentum and Adam on several challenging RNN training tasks. Expand
Large-batch training for LSTM and beyond
TLDR
The Dynamic Adaptive-Tuning Engine (DATE) is proposed for better large-batch training and achieves a 5.3x average speedup over the baselines for four LSTM-based applications on the same hardware. Expand
How to scale distributed deep learning?
TLDR
It is found, perhaps counterintuitively, that asynchronous SGD, including both elastic averaging and gossiping, converges faster at fewer nodes, whereas synchronous SGD scales better to more nodes (up to about 100 nodes). Expand
Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train
TLDR
The challenges and novel solutions needed in order to train ResNet-50 in this large scale environment are described and the novel Collapsed Ensemble (CE) technique is introduced that allows for a 77.5\% top-1 accuracy, similar to that of a Res net-152, while training a unmodified Res Net-50 topology for the same fixed training budget. Expand
A Kronecker-factored approximate Fisher matrix for convolution layers
Second-order optimization methods such as natural gradient descent have the potential to speed up training of neural networks by correcting for the curvature of the loss function. Unfortunately, theExpand
Optimizing Neural Networks with Kronecker-factored Approximate Curvature
TLDR
K-FAC is an efficient method for approximating natural gradient descent in neural networks which is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse. Expand
...
1
2
3
4
5
...