• Corpus ID: 15196840

Improving the speed of neural networks on CPUs

  title={Improving the speed of neural networks on CPUs},
  author={Vincent Vanhoucke and Andrew W. Senior and Mark Z. Mao},
Recent advances in deep learning have made the use of large, deep neural networks with tens of millions of parameters suitable for a number of applications that require real-time processing. [] Key Method We emphasize data layout, batching of the computation, the use of SSE2 instructions, and particularly leverage SSSE3 and SSE4 fixed-point instructions which provide a 3× improvement over an optimized floating-point baseline. We use speech recognition as an example task, and show that a real-time hybrid…

Tables from this paper

Comparing deep learning performance on BigData by using CPUs and GPUs

This paper aimed to use a deep learning approach for processing big data to solve a specific problem in a multi-core platform and it is depicted that use of GPU Technologies increases the performance of system up to 10 times depending on the type of the GPUs.

Using software optimization techniques and exploiting hardware capabilities to speed-up BLSTM neural network on CPUs

This work introduces and demonstrates the efficacy of many software optimization techniques that allow for neural networks to fully benefit from the capabilities of CPUs without compromising their accuracy, and evaluates the proposed optimization techniques using a Bidirectional Long Short-Term Memory neural network to solve an Optical Character Recognition (OCR) problem.

Design and Optimization of Hardware Accelerators for Deep Learning

This dissertation proposes two hardware units, ISAAC and Newton, and shows that in-situ computing designs can outperform DNN digital accelerators, if they leverage pipelining, smart encodings, and can distribute a computation in time and space, within crossbars, and across crossbars.

Virtualizing Deep Neural Networks for Memory-Efficient Neural Network Design

The authors' virtualized DNN (vDNN) reduces the average memory usage of AlexNet by 61% and OverFeat by 83%, a significant reduction in memory requirements of DNNs.

vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design

The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU. This restriction

A Survey on Methods and Theories of Quantized Neural Networks

A thorough review of different aspects of quantized neural networks is given, recognized as one of the most effective approaches to satisfy the extreme memory requirements that deep neural network models demand.

On the quantization of recurrent neural networks

This work presents an integer-only quantization strategy for Long Short-Term Memory (LSTM) neural network topologies, which themselves are the foundation of many production ML systems.

8-Bit Approximations for Parallelism in Deep Learning

8-bit approximation is an efficient method to parallelize convolutional networks on very large systems of GPUs and achieves state-of-the-art speedups for model parallelism.

Transfer Learning with Binary Neural Networks

It is shown that a single binary neural network trained on the Imagenet dataset can indeed be used as a feature extractor for other datasets and is proposed as a transfer learning based architecture.

DaDianNao: A Neural Network Supercomputer

A custom multi-chip machine-learning architecture containing a combination of custom storage and computational units, with electrical and optical inter-chip interconnects separately is introduced, and it is shown that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 656.63× over a GPU, and reduce the energy by 184.05× on average for a 64-chip system.

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

This paper discusses optimization techniques for both CPU and GPU, analyzes what architecture features contributed to performance differences between the two architectures, and recommends a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

Neural Network Implementation Using CUDA and OpenMP

This paper proposes more quick and efficient implementation of neural networks on both GPU and multi-core CPU and uses CUDA (compute unified device architecture) that can be easily programmed due to its simple C language-like style instead of GPU to solve the first problem.

Large-scale deep unsupervised learning using graphics processors

It is argued that modern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods.

GPU implementation of neural networks

Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition

This paper reports results of a DBN-pretrained context-dependent ANN/HMM system trained on two datasets that are much larger than any reported previously, and outperforms the best Gaussian Mixture Model Hidden Markov Model baseline.

Faster matrix-vector multiplication on GeForce 8800GTX

  • N. Fujimoto
  • Computer Science
    2008 IEEE International Symposium on Parallel and Distributed Processing
  • 2008
The performance, including the data transfer between CPU and GPU, of Jacobi's iterative method for solving linear equations shows that the proposed algorithm is practical for some real applications.

CUDAMat: a CUDA-based matrix class for Python

The feature set of CUDAMat is biased towards providing functionality useful for implementing standard machine learning algorithms, however, it is general enough to be useful in other elds.

Use of Gaussian selection in large vocabulary continuous speech recognition using HMMS

  • K. KnillM. GalesS. Young
  • Computer Science
    Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96
  • 1996
This paper investigates the use of Gaussian Selection to reduce the state likelihood computation in HMM-based systems and investigates the trade-offs necessary between achieving good state likelihoods and low computation.

The bucket box intersection (BBI) algorithm for fast approximative evaluation of diagonal mixture Gaussians

  • J. FritschI. Rogina
  • Computer Science
    1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings
  • 1996
This paper presents an alternative approach to approximate mixture Gaussians with diagonal covariance matrices, based on a binary feature space partitioning tree, which achieves a speedup of 2-5 in the computation of HMM emission probabilities, without affecting the accuracy of the system.