# Improving the speed of neural networks on CPUs

@inproceedings{Vanhoucke2011ImprovingTS, title={Improving the speed of neural networks on CPUs}, author={Vincent Vanhoucke and Andrew W. Senior and Mark Z. Mao}, year={2011} }

Recent advances in deep learning have made the use of large, deep neural networks with tens of millions of parameters suitable for a number of applications that require real-time processing. [] Key Method We emphasize data layout, batching of the computation, the use of SSE2 instructions, and particularly leverage SSSE3 and SSE4 fixed-point instructions which provide a 3× improvement over an optimized floating-point baseline. We use speech recognition as an example task, and show that a real-time hybrid…

## 773 Citations

### Comparing deep learning performance on BigData by using CPUs and GPUs

- Computer Science2018 Electric Electronics, Computer Science, Biomedical Engineerings' Meeting (EBBT)
- 2018

This paper aimed to use a deep learning approach for processing big data to solve a specific problem in a multi-core platform and it is depicted that use of GPU Technologies increases the performance of system up to 10 times depending on the type of the GPUs.

### Using software optimization techniques and exploiting hardware capabilities to speed-up BLSTM neural network on CPUs

- Computer Science
- 2017

This work introduces and demonstrates the efficacy of many software optimization techniques that allow for neural networks to fully benefit from the capabilities of CPUs without compromising their accuracy, and evaluates the proposed optimization techniques using a Bidirectional Long Short-Term Memory neural network to solve an Optical Character Recognition (OCR) problem.

### Design and Optimization of Hardware Accelerators for Deep Learning

- Computer Science
- 2018

This dissertation proposes two hardware units, ISAAC and Newton, and shows that in-situ computing designs can outperform DNN digital accelerators, if they leverage pipelining, smart encodings, and can distribute a computation in time and space, within crossbars, and across crossbars.

### Virtualizing Deep Neural Networks for Memory-Efficient Neural Network Design

- Computer ScienceArXiv
- 2016

The authors' virtualized DNN (vDNN) reduces the average memory usage of AlexNet by 61% and OverFeat by 83%, a significant reduction in memory requirements of DNNs.

### vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design

- Computer Science2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
- 2016

The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU. This restriction…

### A Survey on Methods and Theories of Quantized Neural Networks

- Computer ScienceArXiv
- 2018

A thorough review of different aspects of quantized neural networks is given, recognized as one of the most effective approaches to satisfy the extreme memory requirements that deep neural network models demand.

### On the quantization of recurrent neural networks

- Computer ScienceArXiv
- 2021

This work presents an integer-only quantization strategy for Long Short-Term Memory (LSTM) neural network topologies, which themselves are the foundation of many production ML systems.

### 8-Bit Approximations for Parallelism in Deep Learning

- Computer ScienceICLR
- 2016

8-bit approximation is an efficient method to parallelize convolutional networks on very large systems of GPUs and achieves state-of-the-art speedups for model parallelism.

### Transfer Learning with Binary Neural Networks

- Computer ScienceNIPS 2017
- 2017

It is shown that a single binary neural network trained on the Imagenet dataset can indeed be used as a feature extractor for other datasets and is proposed as a transfer learning based architecture.

### DaDianNao: A Neural Network Supercomputer

- Computer ScienceIEEE Transactions on Computers
- 2017

A custom multi-chip machine-learning architecture containing a combination of custom storage and computational units, with electrical and optical inter-chip interconnects separately is introduced, and it is shown that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 656.63× over a GPU, and reduce the energy by 184.05× on average for a 64-chip system.

## 12 References

### Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

- Computer ScienceISCA
- 2010

This paper discusses optimization techniques for both CPU and GPU, analyzes what architecture features contributed to performance differences between the two architectures, and recommends a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

### Neural Network Implementation Using CUDA and OpenMP

- Computer Science2008 Digital Image Computing: Techniques and Applications
- 2008

This paper proposes more quick and efficient implementation of neural networks on both GPU and multi-core CPU and uses CUDA (compute unified device architecture) that can be easily programmed due to its simple C language-like style instead of GPU to solve the first problem.

### Large-scale deep unsupervised learning using graphics processors

- Computer ScienceICML '09
- 2009

It is argued that modern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods.

### Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition

- Computer ScienceINTERSPEECH
- 2012

This paper reports results of a DBN-pretrained context-dependent ANN/HMM system trained on two datasets that are much larger than any reported previously, and outperforms the best Gaussian Mixture Model Hidden Markov Model baseline.

### Faster matrix-vector multiplication on GeForce 8800GTX

- Computer Science2008 IEEE International Symposium on Parallel and Distributed Processing
- 2008

The performance, including the data transfer between CPU and GPU, of Jacobi's iterative method for solving linear equations shows that the proposed algorithm is practical for some real applications.

### CUDAMat: a CUDA-based matrix class for Python

- Computer Science
- 2009

The feature set of CUDAMat is biased towards providing functionality useful for implementing standard machine learning algorithms, however, it is general enough to be useful in other elds.

### Use of Gaussian selection in large vocabulary continuous speech recognition using HMMS

- Computer ScienceProceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96
- 1996

This paper investigates the use of Gaussian Selection to reduce the state likelihood computation in HMM-based systems and investigates the trade-offs necessary between achieving good state likelihoods and low computation.

### The bucket box intersection (BBI) algorithm for fast approximative evaluation of diagonal mixture Gaussians

- Computer Science1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings
- 1996

This paper presents an alternative approach to approximate mixture Gaussians with diagonal covariance matrices, based on a binary feature space partitioning tree, which achieves a speedup of 2-5 in the computation of HMM emission probabilities, without affecting the accuracy of the system.