Caffe con Troll: Shallow Ideas to Speed Up Deep Learning

  title={Caffe con Troll: Shallow Ideas to Speed Up Deep Learning},
  author={Firas Abuzaid and Stefan Hadjis and Ce Zhang and Christopher R{\'e}},
  journal={Proceedings of the Fourth Workshop on Data analytics in the Cloud},
We present Caffe con Troll (CcT), a fully compatible end-to-end version of the popular framework Caffe with rebuilt internals. We built CcT to examine the performance characteristics of training and deploying general-purpose convolutional neural networks across different hardware architectures. We find that, by employing standard batching optimizations for CPU training, we achieve a 6:3× throughput improvement over Caffe on popular networks like CaffeNet. Moreover, with these improvements, the… 

Figures from this paper

swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight
To explore the potential of training complex deep neural networks (DNNs) on other commercial chips rather than GPUs, we report our work on swDNN, which is a highly-efficient library for accelerating
Extending Caffe for Machine Learning of Large Neural Networks Distributed on GPUs
This paper extended Caffe to allow to use more than 12GB GPU memory, and executed some training experiments to determine the learning efficiency of the object detection neural net software using a PC with three GPUs.
A Systematic Approach to Blocking Convolutional Neural Networks
This paper explores how to block CNN computations for memory locality by creating an analytical model for CNN-like loop nests that automatically derive optimized blockings for common networks that improve the energy efficiency of custom hardware implementations by up to an order of magnitude.
Deep Learning Approximation: Zero-Shot Neural Network Speedup
This work proposes a techique called Deep Learning Approximation to build a faster network in a tiny fraction of the time required for training by only manipulating the network structure and coefficients without requiring re-training or access to the training data.
Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs
The novel understanding of the interaction between system and optimization dynamics to provide an efficient hyperparameter optimizer is used, demonstrating that the most popular distributed deep learning systems fall within the tradeoff space, but do not optimize within the space.
Escoin: Efficient Sparse Convolutional Neural Network Inference on GPUs
This work proposes Escort, an efficient sparse convolutional neural networks on GPUs that orchestrate the parallelism and locality for the direct sparse Convolution kernel, and applies customized optimization techniques to further improve performance.
Evaluating the Energy Efficiency of Deep Convolutional Neural Networks on CPUs and GPUs
  • Da Li, Xinbo Chen, M. Becchi, Ziliang Zong
  • Computer Science
    2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) (BDCloud-SocialCom-SustainCom)
  • 2016
This paper conducts a comprehensive study on the power behavior and energy efficiency of numerous well-known CNNs and training frameworks on CPUs and GPUs, and provides a detailed workload characterization to facilitate the design of energy efficient deep learning solutions.
CoIn: Accelerated CNN Co-Inference through data partitioning on heterogeneous devices
This work presents a method (CoIn) that benefits from the use of multiple devices that execute simultaneously that achieves the goal of low inference time by partitioning images of a batch on diverse micro-architectures.
Optimizing CNNs on Multicores for Scalability, Performance and Goodput
An automatic framework called spg-CNN is presented for optimizing CNN training on CPUs that comprises of a computation scheduler for efficient parallel execution, and two code generators: one that opts for sparsity, and the other that optimizes for spatial reuse in convolutions.
Distributed Training Large-Scale Deep Architectures
This paper develops a procedure for setting minibatch size and choosing computation algorithms and derives lemmas for determining the quantity of key components such as the number of GPUs and parameter servers for large-scale deep learning training.


Caffe: Convolutional Architecture for Fast Feature Embedding
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.
Dogwild! – Distributed Hogwild for CPU & GPU
This work describes a set of extensions to the state of the art Caffe library, allowing training on multiple threads and GPUs, and across multiple machines, and shows linear performance scaling for small clusters on MNIST, and early results on ImageNet.
Project Adam: Building an Efficient and Scalable Deep Learning Training System
The design and implementation of a distributed system called Adam comprised of commodity server machines to train large deep neural network models that exhibits world-class performance, scaling and task accuracy on visual recognition tasks and shows that task accuracy improves with larger models.
Fast Convolutional Nets With fbfft: A GPU Performance Evaluation
We examine the performance profile of Convolutional Neural Network training on the current generation of NVIDIA Graphics Processing Units. We introduce two new Fast Fourier Transform convolution
ImageNet classification with deep convolutional neural networks
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
Large Scale Distributed Deep Networks
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.
High Performance Convolutional Neural Networks for Document Processing
Three novel approaches to speeding up CNNs are presented: a) unrolling convolution, b) using BLAS (basic linear algebra subroutines), and c) using GPUs (graphic processing units).
cuDNN: Efficient Primitives for Deep Learning
A library similar in intent to BLAS, with optimized routines for deep learning workloads, that contains routines for GPUs, and similarly to the BLAS library, could be implemented for other platforms.
SINGA: Putting Deep Learning in the Hands of Multimedia Users
This paper designs a distributed deep learning platform called SINGA which has an intuitive programming model and good scalability, and experience with developing and training deep learning models for real-life multimedia applications in SINGSA shows that the platform is both usable and scalable.
Text Understanding from Scratch
It is shown that temporal ConvNets can achieve astonishing performance without the knowledge of words, phrases, sentences and any other syntactic or semantic structures with regards to a human language.