• Corpus ID: 21168523

MEC: Memory-efficient Convolution for Deep Neural Network

  title={MEC: Memory-efficient Convolution for Deep Neural Network},
  author={Minsik Cho and Daniel Brand},
Convolution is a critical component in modern deep neural networks, thus several algorithms for convolution have been developed. [] Key Method MEC lowers the input matrix in a simple yet efficient/compact way (i.e., much less memory-overhead), and then executes multiple small matrix multiplications in parallel to get convolution completed. Additionally, the reduced memory footprint improves memory sub-system efficiency, improving performance. Our experimental results show that MEC reduces memory consumption…

Figures and Tables from this paper

High Performance Zero-Memory Overhead Direct Convolutions

This paper demonstrates that direct convolution, when implemented correctly, eliminates all memory overhead, and yields performance that is between 10% to 400% times better than existing high performance implementations of convolution layers on conventional and embedded CPU architectures.

Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels

A new kernel fusion technique for fast convolution algorithms based on MegaKernel is proposed, which achieves an average of 1.25X and 1.7X speedup compared to cuDNN's two implementations on Winograd convolution algorithm.

Parallel convolution algorithm using implicit matrix multiplication on multi-core CPUs

This paper presents a new parallel convolution algorithm using implicit matrix multiplication on multi-core CPUs and shows that the new algorithm gives much better performance and scalability than the im2col+GEMM method in most cases.

Optimizing GPU Memory Transactions for Convolution Operations

This paper presents a novel approach to optimize memory access for convolution operations, specifically targeting GPU execution, that leverages two optimization techniques to reduce the number of memory operations for Convolution operations performed on the width and height dimensions.

High-Performance Low-Memory Lowering: GEMM-based Algorithms for DNN Convolution

It is shown that the im2col algorithm is just one point in a regular design space of algorithms which translate convolution to GEMM, and several novel low-memory algorithms which match the performance of the best known approaches despite requiring only a small fraction of the additional memory.

mGEMM: low-latency convolution with minimal memory overhead optimized for mobile devices

The convolution layer is the key building block in many neural network designs. Most high-performance implementations of the convolution operation rely on GEMM (General Matrix Multiplication) to

Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores

A GPU architecture named Duplo is introduced that minimizes redundant memory accesses of convolutions in deep neural networks (DNNs) by leveraging compile-time information and microarchitectural supports to detect and eliminate redundantMemory accesses that repeatedly load the duplicates of data in the workspace matrix.

Low-memory GEMM-based convolution algorithms for deep neural networks

Two novel GEMM-based algorithms that require just a fraction of the amount of additional memory for DNN convolution, making it much more suitable for memory-limited embedded systems are presented.

The Indirect Convolution Algorithm

The Indirect Convolution algorithm provides the efficiency of the GEMM primitive without the overhead of im2col transformation, and introduces an indirection buffer - a buffer of pointers to the start of each row of image pixels that broadens the application of the modified G EMM function to convolutions with arbitrary kernel size, padding, stride, and dilation.

Crossbar-Aware Neural Network Pruning

The proposed crossbar-aware pruning framework is able to reduce the resource overhead and the related energy cost and provides a new co-design solution for mapping CNNs onto various crossbar devices with much better efficiency.



Zero and data reuse-aware fast convolution for deep neural networks on GPU

This work proposes a low-overhead and efficient hardware mechanism that skips multiplications that will always give zero results regardless of input data, and presents data reuse optimization for addition operations in Winograd convolution (called AddOpt), which improves the utilization of local registers, thereby reducing on-chip cache accesses.

Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs

This work studies the memory efficiency of various CNN layers and reveals the performance implication from both data layouts and memory access patterns, which shows the universal effect of the proposed optimizations on both single layers and various networks.

Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications

A simple and effective scheme to compress the entire CNN, called one-shot whole network compression, which addresses the important implementation level issue on 1?1 convolution, which is a key operation of inception module of GoogLeNet as well as CNNs compressed by the proposed scheme.

Fast Convolutional Nets With fbfft: A GPU Performance Evaluation

We examine the performance profile of Convolutional Neural Network training on the current generation of NVIDIA Graphics Processing Units. We introduce two new Fast Fourier Transform convolution

Accelerating Convolutional Neural Networks for Mobile Applications

An efficient and effective approach is proposed to accelerate the test-phase computation of CNNs based on low-rank and group sparse tensor decomposition, which achieves significant reduction in computational complexity, at the cost of negligible loss in accuracy.

Speeding up Convolutional Neural Networks with Low Rank Expansions

Two simple schemes for drastically speeding up convolutional neural networks are presented, achieved by exploiting cross-channel or filter redundancy to construct a low rank basis of filters that are rank-1 in the spatial domain.

Fast Training of Convolutional Networks through FFTs

This work presents a simple algorithm which accelerates training and inference by a significant factor, and can yield improvements of over an order of magnitude compared to existing state-of-the-art implementations.

Fast Algorithms for Convolutional Neural Networks

  • Andrew LavinScott Gray
  • Computer Science
    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2016
A new class of fast algorithms for convolutional neural networks is introduced using Winograd's minimal filtering algorithms, which compute minimal complexity convolution over small tiles, which makes them fast with small filters and small batch sizes.

High Performance Convolutional Neural Networks for Document Processing

Three novel approaches to speeding up CNNs are presented: a) unrolling convolution, b) using BLAS (basic linear algebra subroutines), and c) using GPUs (graphic processing units).

cuDNN: Efficient Primitives for Deep Learning

A library similar in intent to BLAS, with optimized routines for deep learning workloads, that contains routines for GPUs, and similarly to the BLAS library, could be implemented for other platforms.