• Corpus ID: 202786778

PyTorch: An Imperative Style, High-Performance Deep Learning Library

  title={PyTorch: An Imperative Style, High-Performance Deep Learning Library},
  author={Adam Paszke and Sam Gross and Francisco Massa and Adam Lerer and James Bradbury and Gregory Chanan and Trevor Killeen and Zeming Lin and Natalia Gimelshein and Luca Antiga and Alban Desmaison and Andreas K{\"o}pf and Edward Yang and Zach DeVito and Martin Raison and Alykhan Tejani and Sasank Chilamkurthy and Benoit Steiner and Lu Fang and Junjie Bai and Soumith Chintala},
Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it was designed from first principles to support an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs. In this paper, we detail… 

Figures and Tables from this paper

VirtualFlow: Decoupling Deep Learning Models from the Underlying Hardware

This work proposes VirtualFlow, a system leveraging a novel abstraction called virtual node processing to decouple the model from the hardware, which enables many new use cases, such as reproducing training results across different hardware, resource elasticity, and heterogeneous training.

Efficient Execution of Quantized Deep Learning Models: A Compiler Approach

This paper addresses the challenges of executing quantized deep learning models on diverse hardware platforms by proposing an augmented compiler approach that created a new dialect called Quantized Neural Network (QNN) that extends the compiler's internal representation with a quantization context.

A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks

This paper discusses and compares current state-of-the-art frameworks for large scale distributed deep learning and identifies the different types of parallelism used and presents empirical results comparing their performance on large image and language training tasks.

Evaluation of PyTorch as a Data-Parallel Programming API for GPU Volume Rendering

It is found that most relevant DPP primitives exhibit performance similar to a native CUDA library, however, PyTorch is limited in expressiveness when compared to other DPP APIs, and while render times are sufficient for an early “proof of concept”, memory usage acutely limits scalability.

Flexible Performant GEMM Kernels on GPUs

This paper presents three sets of abstractions and interfaces to program GEMMs within the scientific Julia programming language, and demonstrates that their performance is in the same ballpark of the libraries, and in some cases even exceeds it, without having to write a single line of code in CUDA C++ or assembly, and without facing flexibility limitations.

Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA

  • M. WahibHaoyu Zhang S. Matsuoka
  • Computer Science
    SC20: International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2020
A performance model based on the concurrency analysis of out-of-core training behavior, and a strategy that combines layer swapping and redundant recomputing is proposed that can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.

DeepCuts: a deep learning optimization framework for versatile GPU workloads

The evaluation result with various DL workloads for inference and training indicates that DeepCuts outperforms cuDNN/cuBLAS-based implementations and the state-of-the-art DL optimization frameworks, such as TVM, TensorFlow XLA, and TensorRT.

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

This work proposes an alternative approach that automatically trans- lates programs written in one programming model (CUDA), into another (CPU threads) based on Polygeist/MLIR, and includes a representation of parallel constructs that allows conventional compiler transformations to apply transpar- ently and without modification and enables parallelism-speci-c optimizations.

MIOpen: An Open Source Library For Deep Learning Primitives

  • Jehandad KhanPaul Fultz Mayank Daga
  • Computer Science
    Proceedings of the 30th International Conference on Computer Graphics and Machine Vision (GraphiCon 2020). Part 2
  • 2020
This paper introduces MIOpen and provides details about the internal workings of the library and supported features, as well as implementing fusion to optimize for memory bandwidth and GPU launch overheads, and implementing different algorithms to optimize convolutions for different filter and input sizes.

Transparent acceleration of Java-based deep learning engines

TornadoVM is employed, a state-of-the-art heterogeneous programming framework, to transparently accelerate Deep Netts on heterogeneous hardware, showing how a pure Java-based deep learning neural network engine can be dynamically compiled at runtime and specialized for particular hardware accelerators, without requiring developers to employ any low-level programming framework typically used for such devices.



cuDNN: Efficient Primitives for Deep Learning

A library similar in intent to BLAS, with optimized routines for deep learning workloads, that contains routines for GPUs, and similarly to the BLAS library, could be implemented for other platforms.

Automatic differentiation in PyTorch

An automatic differentiation module of PyTorch is described — a library designed to enable rapid research on machine learning models that focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead.

Theano: A Python framework for fast computation of mathematical expressions

The performance of Theano is compared against Torch7 and TensorFlow on several machine learning models and recently-introduced functionalities and improvements are discussed.

Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking, and presents an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work.

Caffe: Convolutional Architecture for Fast Feature Embedding

Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

Torch7: A Matlab-like Environment for Machine Learning

Torch7 is a versatile numeric computing framework and machine learning library that extends Lua that can easily be interfaced to third-party software thanks to Lua’s light interface.

DyNet: The Dynamic Neural Network Toolkit

DyNet is a toolkit for implementing neural network models based on dynamic declaration of network structure that has an optimized C++ backend and lightweight graph representation and is designed to allow users to implement their models in a way that is idiomatic in their preferred programming language.

maxDNN: An Efficient Convolution Kernel for Deep Learning with Maxwell GPUs

This paper describes maxDNN, a computationally efficient convolution kernel for deep learning with the NVIDIA Maxwell GPU. maxDNN reaches 96.3% computational efficiency on typical deep learning

Hoard: a scalable memory allocator for multithreaded applications

Hoard is the first allocator to simultaneously solve the above problems, and combines one global heap and per-processor heaps with a novel discipline that provably bounds memory consumption and has very low synchronization costs in the common case.

Julia: A Fresh Approach to Numerical Computing

The Julia programming language and its design is introduced---a dance between specialization and abstraction, which recognizes what remains the same after computation, and which is best left untouched as they have been built by the experts.