• Corpus ID: 202786778

PyTorch: An Imperative Style, High-Performance Deep Learning Library

@inproceedings{Paszke2019PyTorchAI,
  title={PyTorch: An Imperative Style, High-Performance Deep Learning Library},
  author={Adam Paszke and Sam Gross and Francisco Massa and Adam Lerer and James Bradbury and Gregory Chanan and Trevor Killeen and Zeming Lin and Natalia Gimelshein and Luca Antiga and Alban Desmaison and Andreas K{\"o}pf and Edward Yang and Zach DeVito and Martin Raison and Alykhan Tejani and Sasank Chilamkurthy and Benoit Steiner and Lu Fang and Junjie Bai and Soumith Chintala},
  booktitle={NeurIPS},
  year={2019}
}
Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it was designed from first principles to support an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs. In this paper, we detail… 

Figures, Tables, and Topics from this paper

Project CGX: Scalable Deep Learning on Commodity GPUs
TLDR
This paper investigates whether the expensive hardware overprovisioning approach can be supplanted via algorithmic and system design, and proposes a framework called CGX, which provides efficient software support for communication compression, and is able to remove communication bottlenecks from consumer-grade multi-GPU systems, in the absence of hardware support.
Efficient Execution of Quantized Deep Learning Models: A Compiler Approach
TLDR
This paper addresses the challenges of executing quantized deep learning models on diverse hardware platforms by proposing an augmented compiler approach that created a new dialect called Quantized Neural Network (QNN) that extends the compiler's internal representation with a quantization context.
Evaluation of PyTorch as a Data-Parallel Programming API for GPU Volume Rendering
Data-parallel programming (DPP) has attracted considerable interest from the visualization community, fostering major software initiatives such as VTK-m. However, there has been relatively little
Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA
TLDR
A performance model based on the concurrency analysis of out-of-core training behavior is proposed, and a strategy that combines layer swapping and redundant recomputing is derived that can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA
The dedicated memory of hardware accelerators can be insufficient to store all weights and/or intermediate states of large deep learning models. Although model parallelism is a viable approach to
DeepCuts: a deep learning optimization framework for versatile GPU workloads
TLDR
The evaluation result with various DL workloads for inference and training indicates that DeepCuts outperforms cuDNN/cuBLAS-based implementations and the state-of-the-art DL optimization frameworks, such as TVM, TensorFlow XLA, and TensorRT.
OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems
TLDR
OMB-Py—Python extensions to the open-source OSU MicroBenchmark (OMB) suite—aimed to evaluate communication performance of MPI-based parallel applications in Python reveals that mpi4py introduces a small overhead when compared to native MPI libraries.
MIOpen: An Open Source Library For Deep Learning Primitives
  • Jehandad Khan, Paul Fultz, +12 authors Mayank Daga
  • Computer Science, Mathematics
    Proceedings of the 30th International Conference on Computer Graphics and Machine Vision (GraphiCon 2020). Part 2
  • 2020
TLDR
This paper introduces MIOpen and provides details about the internal workings of the library and supported features, as well as implementing fusion to optimize for memory bandwidth and GPU launch overheads, and implementing different algorithms to optimize convolutions for different filter and input sizes.
Transparent acceleration of Java-based deep learning engines
TLDR
TornadoVM is employed, a state-of-the-art heterogeneous programming framework, to transparently accelerate Deep Netts on heterogeneous hardware, showing how a pure Java-based deep learning neural network engine can be dynamically compiled at runtime and specialized for particular hardware accelerators, without requiring developers to employ any low-level programming framework typically used for such devices.
SOL: Effortless Device Support for AI Frameworks without Source Code Changes
  • Nicolas Weber, Felipe Huici
  • Computer Science
    2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)
  • 2020
TLDR
SOL, an AI acceleration middleware that provides a hardware abstraction layer that allows for transparently support heterogenous hardware in AI frameworks without changing the framework’s source code in order to minimize maintenance overhead is introduced.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 33 REFERENCES
cuDNN: Efficient Primitives for Deep Learning
TLDR
A library similar in intent to BLAS, with optimized routines for deep learning workloads, that contains routines for GPUs, and similarly to the BLAS library, could be implemented for other platforms.
Automatic differentiation in PyTorch
TLDR
An automatic differentiation module of PyTorch is described — a library designed to enable rapid research on machine learning models that focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead.
Theano: A Python framework for fast computation of mathematical expressions
TLDR
The performance of Theano is compared against Torch7 and TensorFlow on several machine learning models and recently-introduced functionalities and improvements are discussed.
Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
TLDR
This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking, and presents an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work.
Caffe: Convolutional Architecture for Fast Feature Embedding
TLDR
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.
CNTK: Microsoft's Open-Source Deep-Learning Toolkit
TLDR
This tutorial will introduce the Computational Network Toolkit, or CNTK, Microsoft's cutting-edge open-source deep-learning toolkit for Windows and Linux, and show how typical uses looks like for relevant tasks like image recognition, sequence-to-sequence modeling, and speech recognition.
Torch7: A Matlab-like Environment for Machine Learning
TLDR
Torch7 is a versatile numeric computing framework and machine learning library that extends Lua that can easily be interfaced to third-party software thanks to Lua’s light interface.
DyNet: The Dynamic Neural Network Toolkit
TLDR
DyNet is a toolkit for implementing neural network models based on dynamic declaration of network structure that has an optimized C++ backend and lightweight graph representation and is designed to allow users to implement their models in a way that is idiomatic in their preferred programming language.
maxDNN: An Efficient Convolution Kernel for Deep Learning with Maxwell GPUs
This paper describes maxDNN, a computationally efficient convolution kernel for deep learning with the NVIDIA Maxwell GPU. maxDNN reaches 96.3% computational efficiency on typical deep learning
Hoard: a scalable memory allocator for multithreaded applications
TLDR
Hoard is the first allocator to simultaneously solve the above problems, and combines one global heap and per-processor heaps with a novel discipline that provably bounds memory consumption and has very low synchronization costs in the common case.
...
1
2
3
4
...