• Publications
  • Influence
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
TLDR
We present numerical evidence that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. Expand
  • 1,161
  • 124
  • PDF
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
TLDR
An analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core CPUs and GPUs. Expand
  • 809
  • 43
  • PDF
Efficient sparse matrix-vector multiplication on x86-based many-core processors
TLDR
We identify and address several bottlenecks which may limit performance even before memory bandwidth: (a) low SIMD efficiency due to sparsity, (b) overhead due to irregular memory accesses, and (c) load-imbalance due to non-uniform matrix structures. Expand
  • 194
  • 15
  • PDF
qHiPSTER: The Quantum High Performance Software Testing Environment
TLDR
We present qHiPSTER, the Quantum High Performance Software Testing Environment. Expand
  • 92
  • 14
  • PDF
Convergence of Recognition, Mining, and Synthesis Workloads and Its Implications
TLDR
This paper examines the growing need for a general-purpose ldquoanalytics enginerdquo that can enable next-generation processing platforms to effectively model events, objects, and concepts based on end-user input, and accessible datasets, along with an ability to iteratively refine the model in real-time. Expand
  • 113
  • 11
  • PDF
Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor
TLDR
Dense linear algebra has been traditionally used to evaluate the performance and efficiency of new architectures. Expand
  • 135
  • 8
  • PDF
The Architectural Implications of Facebook's DNN-Based Personalized Recommendation
TLDR
The widespread application of deep learning has changed the landscape of computation in data centers. Expand
  • 45
  • 8
  • PDF
Anatomy of High-Performance Many-Threaded Matrix Multiplication
TLDR
BLIS is a new framework for rapid instantiation of the BLAS. Expand
  • 90
  • 7
  • PDF
Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver
TLDR
The last decade has seen rapid growth of single-chip multiprocessors CMPs, which have been leveraging Moore's law to deliver high concurrency via increases in the number of cores and vector width. Expand
  • 49
  • 7
  • PDF
High Performance Parallel Stochastic Gradient Descent in Shared Memory
TLDR
In this paper, we explore several modern parallelization methods of SGD on a shared memory system, in the context of sparse and convex optimization problems, and show that their parallel efficiency is severely limited by inter-core communication. Expand
  • 25
  • 7
  • PDF