Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

@article{Lee2010DebunkingT1,
  title={Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU},
  author={Victor W. Lee and Changkyu Kim and Jatin Chhugani and Michael E. Deisher and Daehyun Kim and Anthony D. Nguyen and Nadathur Satish and Mikhail Smelyanskiy and Srinivas Chennupaty and Per Hammarlund and Ronak Singhal and Pradeep K. Dubey},
  journal={Proceedings of the 37th annual international symposium on Computer architecture},
  year={2010}
}
  • V. Lee, Changkyu Kim, P. Dubey
  • Published 19 June 2010
  • Computer Science
  • Proceedings of the 37th annual international symposium on Computer architecture
Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver… 

Figures and Tables from this paper

Performance Analysis of Application Kernels in Multi / Many-Core Architectures
TLDR
This work performed a performance comparison of important app kernels like Image Convolution, Histogram and Bilateral filtering in multi-core CPU, many-core NVIDIA GPUs in addition to comparing the research framework GPU enabled ManyTask Computing (GeMTC).
CPU and/or GPU: Revisiting the GPU Vs. CPU Myth
TLDR
This work suggests that hybrid computing can offer tremendous advantages at not only research-scale platforms but also the more realistic scale systems with significant performance gains and resource efficiency to the large scale user community.
Where is the data? Why you cannot debate CPU vs. GPU performance without the answer
  • Chris Gregg, Kim M. Hazelwood
  • Computer Science
    (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE
  • 2011
TLDR
A taxonomy for future CPU/GPU comparisons is suggested, and it is argued that this is not only germane for reporting performance, but is important to heterogeneous scheduling research in general.
A Survey of CPU-GPU Heterogeneous Computing Techniques
TLDR
This article surveys Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency and reviews both discrete and fused CPU-GPU systems.
Characterization and Exploitation of GPU Memory Systems
TLDR
This thesis intends to show the importance of memory optimizations for GPU systems, and addresses problems of data transfer and global atomic memory contention, and provides a theoretical model which can be used to correctly predict the comparative performance of memory movement techniques for a given data-intensive application and system.
Improving performance of data-parallel applications on CPU-GPU heterogeneous systems
TLDR
This thesis explores the performance and energy efficiency of CUDA-enabled GPUs and multi-core SIMD CPUs and numerical simulations of cardiac action potential propagation, which is a valuable tool for understanding the mechanisms that promote arrhythmias that may degenerate into spiral wave propagation.
Paragon: collaborative speculative loop execution on GPU and CPU
TLDR
Paragon is a collaborative static/dynamic compiler platform to speculatively run possibly-data-parallel pieces of sequential applications on GPUs that are present in everyday computing devices such as laptops and mobile systems.
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
TLDR
This paper empirically characterize and analyze the efficacy of AMD Fusion, an architecture that combines general-purposex86 cores and programmable accelerator cores on the same silicon die, and characterize its performance via a set of micro-benchmarks.
Performance Analysis of a Hybrid MPI / CUDA Implementation of the NAS-LU Benchmark
TLDR
An analysis of a port of the NAS LU benchmark to NVIDIA’s Compute Unified Device Architecture (CUDA) the most stable GPU programming model currently available is presented and the runtime performance of LU is predicted to become commonplace in future highend HPC architectural solutions.
Suitability Analysis of GPUs and CPUs for Graph Algorithms
  • Computer Science
  • 2016
TLDR
The results of this thesis lead to the conclusion that the higher energy efficiency and, depending on the point of view, cost efficiency of the GPUs do not outweigh the lower programming effort for the implementation of graph algorithms on CPUs.
...
...

References

SHOWING 1-10 OF 74 REFERENCES
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
TLDR
A simple analytical model is proposed that estimates the execution time of massively parallel programs by considering the number of running threads and memory bandwidth and estimates the cost of memory requests, thereby estimating the overall executionTime of a program.
GPUTeraSort: high performance graphics co-processor sorting for large database management
TLDR
Overall, the results indicate that using a GPU as a co-processor can significantly improve the performance of sorting algorithms on large databases.
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort
TLDR
This paper presents a competitive analysis of comparison and non-comparison based sorting algorithms on two modern architectures - the latest CPU and GPU architectures, and proposes novel CPU radix sort and GPU merge sort implementations which are 2X faster than previously published results.
Benchmarking GPUs to tune dense linear algebra
  • V. Volkov, J. Demmel
  • Computer Science
    2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2008
TLDR
It is argued that modern GPUs should be viewed as multithreaded multicore vector units and exploit blocking similarly to vector computers and heterogeneity of the system by computing both on GPU and CPU.
Efficient computation of sum-products on GPUs through software-managed cache
TLDR
A GPU-based MPF solver achieves up to 2700-fold speedup on random data and 270-fold on real-life genetic analysis datasets on GeForce 8800GTX GPU from NVIDIA over the optimized CPU version on an Intel 2.4GHz Core 2 with a 4MB L2 cache.
High performance discrete Fourier transforms on graphics processors
TLDR
This work implemented their algorithms using the NVIDIA CUDA API and compared their performance with NVIDIA's CUFFT library and an optimized CPU-implementation (Intel's MKL) on a high-end quad-core CPU.
FAST: fast architecture sensitive tree search on modern CPUs and GPUs
TLDR
FAST is an extremely fast architecture sensitive layout of the index tree logically organized to optimize for architecture features like page size, cache line size, and SIMD width of the underlying hardware, and achieves a 6X performance improvement over uncompressed index search for large keys on CPUs.
LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs
TLDR
It is argued that modern GPUs should be viewed as multithreaded multicore vector units and exploit blocking similarly to vector computers and heterogeneity of the system by computing both on GPU and CPU.
Designing efficient sorting algorithms for manycore GPUs
TLDR
The design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA, are described, which are the fastest GPU sort and the fastest comparison-based sort reported in the literature.
Efficient implementation of sorting on multi-core SIMD CPU architecture
TLDR
An efficient implementation and detailed analysis of MergeSort on current CPU architectures, and performance scalability of the proposed sorting algorithm with respect to certain salient architectural features of modern chip multiprocessor (CMP) architectures, including SIMD width and core-count.
...
...