• Corpus ID: 15196840

Improving the speed of neural networks on CPUs

@inproceedings{Vanhoucke2011ImprovingTS,
  title={Improving the speed of neural networks on CPUs},
  author={Vincent Vanhoucke and Andrew W. Senior and Mark Z. Mao},
  year={2011}
}
Recent advances in deep learning have made the use of large, deep neural networks with tens of millions of parameters suitable for a number of applications that require real-time processing. [] Key Method We emphasize data layout, batching of the computation, the use of SSE2 instructions, and particularly leverage SSSE3 and SSE4 fixed-point instructions which provide a 3× improvement over an optimized floating-point baseline. We use speech recognition as an example task, and show that a real-time hybrid…

Tables from this paper

Comparing deep learning performance on BigData by using CPUs and GPUs
TLDR
This paper aimed to use a deep learning approach for processing big data to solve a specific problem in a multi-core platform and it is depicted that use of GPU Technologies increases the performance of system up to 10 times depending on the type of the GPUs.
Using software optimization techniques and exploiting hardware capabilities to speed-up BLSTM neural network on CPUs
TLDR
This work introduces and demonstrates the efficacy of many software optimization techniques that allow for neural networks to fully benefit from the capabilities of CPUs without compromising their accuracy, and evaluates the proposed optimization techniques using a Bidirectional Long Short-Term Memory neural network to solve an Optical Character Recognition (OCR) problem.
Virtualizing Deep Neural Networks for Memory-Efficient Neural Network Design
TLDR
The authors' virtualized DNN (vDNN) reduces the average memory usage of AlexNet by 61% and OverFeat by 83%, a significant reduction in memory requirements of DNNs.
vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design
The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU. This restriction
A Survey on Methods and Theories of Quantized Neural Networks
TLDR
A thorough review of different aspects of quantized neural networks is given, recognized as one of the most effective approaches to satisfy the extreme memory requirements that deep neural network models demand.
A survey of neural network accelerators
TLDR
This review can serve as a reference for hardware researchers in the area of neural networks and recent related works, as well as the DianNao-family accelerators.
On the quantization of recurrent neural networks
TLDR
This work presents an integer-only quantization strategy for Long Short-Term Memory (LSTM) neural network topologies, which themselves are the foundation of many production ML systems.
8-Bit Approximations for Parallelism in Deep Learning
TLDR
8-bit approximation is an efficient method to parallelize convolutional networks on very large systems of GPUs and achieves state-of-the-art speedups for model parallelism.
Transfer Learning with Binary Neural Networks
TLDR
It is shown that a single binary neural network trained on the Imagenet dataset can indeed be used as a feature extractor for other datasets and is proposed as a transfer learning based architecture.
DaDianNao: A Neural Network Supercomputer
TLDR
A custom multi-chip machine-learning architecture containing a combination of custom storage and computational units, with electrical and optical inter-chip interconnects separately is introduced, and it is shown that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 656.63× over a GPU, and reduce the energy by 184.05× on average for a 64-chip system.
...
...

References

SHOWING 1-10 OF 16 REFERENCES
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
TLDR
This paper discusses optimization techniques for both CPU and GPU, analyzes what architecture features contributed to performance differences between the two architectures, and recommends a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.
Neural Network Implementation Using CUDA and OpenMP
TLDR
This paper proposes more quick and efficient implementation of neural networks on both GPU and multi-core CPU and uses CUDA (compute unified device architecture) that can be easily programmed due to its simple C language-like style instead of GPU to solve the first problem.
Large-scale deep unsupervised learning using graphics processors
TLDR
It is argued that modern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods.
GPU implementation of neural networks
Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition
TLDR
This paper reports results of a DBN-pretrained context-dependent ANN/HMM system trained on two datasets that are much larger than any reported previously, and outperforms the best Gaussian Mixture Model Hidden Markov Model baseline.
Faster matrix-vector multiplication on GeForce 8800GTX
  • N. Fujimoto
  • Computer Science
    2008 IEEE International Symposium on Parallel and Distributed Processing
  • 2008
TLDR
The performance, including the data transfer between CPU and GPU, of Jacobi's iterative method for solving linear equations shows that the proposed algorithm is practical for some real applications.
CUDAMat: a CUDA-based matrix class for Python
TLDR
The feature set of CUDAMat is biased towards providing functionality useful for implementing standard machine learning algorithms, however, it is general enough to be useful in other elds.
Use of Gaussian selection in large vocabulary continuous speech recognition using HMMS
  • K. Knill, M. Gales, S. Young
  • Computer Science
    Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96
  • 1996
TLDR
This paper investigates the use of Gaussian Selection to reduce the state likelihood computation in HMM-based systems and investigates the trade-offs necessary between achieving good state likelihoods and low computation.
The bucket box intersection (BBI) algorithm for fast approximative evaluation of diagonal mixture Gaussians
  • J. Fritsch, I. Rogina
  • Computer Science
    1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings
  • 1996
TLDR
This paper presents an alternative approach to approximate mixture Gaussians with diagonal covariance matrices, based on a binary feature space partitioning tree, which achieves a speedup of 2-5 in the computation of HMM emission probabilities, without affecting the accuracy of the system.
...
...