Eliminating Dark Bandwidth: A Data-Centric View of Scalable, Efficient Performance, Post-Moore

  title={Eliminating Dark Bandwidth: A Data-Centric View of Scalable, Efficient Performance, Post-Moore},
  author={Jonathan C. Beard and Joshua Randall},
  booktitle={ISC Workshops},
Most of computing research has focused on the computing technologies themselves versus how full systems make use of them (e.g., memory fabric, interconnect, software, and compute elements combined). Technologists have largely failed to look at the compute system as a whole, instead optimizing subsystems mostly in isolation. The result, for example, is that systems are built where applications can only ask for a fixed multiple of data (e.g., 64-bytes from DRAM), even if what is required is far… 
PLANAR: a programmable accelerator for near-memory data rearrangement
PANAR, a programmable near-memory accelerator that rearranges sparse data into dense, is presented, a design that scales well with multi-core systems, hides operation latency by performing non-blocking fine-grain data rearrangements, and eases programmability by supporting virtual memory and conventional memory allocation mechanisms.
POSTER: SPiDRE: Accelerating Sparse Memory Access Patterns
This work explores the Sparse Data Rearrange Engine (SPiDRE), a novel hardware approach to accelerate near-memory data reorganization for sparse and irregular memory access patterns in data analytics applications.
DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks
This research presents a probabilistic analysis of the response of the immune system to natural disasters to the presence of carbon dioxide in the environment.
The Non-Uniform Compute Device (NUCD) Architecture for Lightweight Accelerator Offload
This work presents the non-uniform compute device (NUCD) system architecture as a novel lightweight and generic accelerator offload mechanism that is tightly-coupled with a general-purpose processor core to enable a low-latency out-of-order task offload to heterogeneous devices.
The sparse data reduction engine: chopping sparse data one byte at a time
This paper presents a general solution for a programmable data rearrangement/reduction engine near-memory to deliver bulk byte-addressable data access and describes a programmer interface that enables all combinations of rearrangements.
Multi-spectral Reuse Distance: Divining Spatial Information from Temporal Data
A new way to infer both spatial and temporal locality using reuse distance analysis is presented, accomplished by performing reusedistance analysis at different data block granularities: specifically, 64B, 4KiB, and 2MiB sizes.


Observations and opportunities in architecting shared virtual memory for heterogeneous systems
This work analyzes, using real-system measurements, shared virtual memory across the CPU and an integrated GPU, and presents a detailed measurement study of a commercially available integrated APU that illustrates these effects and motivates future research opportunities.
Memory Systems: Cache, DRAM, Disk
Is your memory hierarchy stopping your microprocessor from performing at the high level it should be? Memory Systems: Cache, DRAM, Disk shows you how to resolve this problem. The book tells you
Improving cache utilisation
This thesis demonstrates that cache utilisation is relatively poor over a wide range of benchmarks and cache configurations, and presents a variety of such predictors, mostly based upon the mature field of branch prediction, and compares them against previously proposed predictors.
Energy-efficient address translation
This work proposes Lite, a mechanism that monitors the performance and utility of L1 TLBs, and adaptively changes their sizes with way-disabling, and proposes RMMLite, a method that targets the recently proposed Redundant Memory Mappings address-translation mechanism.
BigDataBench: A big data benchmark suite from internet services
  • Lei Wang, Jianfeng Zhan, Bizhu Qiu
  • Computer Science
    2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)
  • 2014
The big data benchmark suite-BigDataBench not only covers broad application scenarios, but also includes diverse and representative data sets, and comprehensively characterize 19 big data workloads included in BigDataBench with varying data inputs.
Limits on fundamental limits to computation
Fundamental limits to computation in the areas of manufacturing, energy, physical space, design and verification effort, and algorithms are reviewed, to outline what is achievable in principle and in practice.
In-Memory Data Rearrangement for Irregular, Data-Intensive Computing
An emulation on a field-programmable gate array shows how the data rearrangement engine could improve speedup, memory bandwidth, and energy consumption on three representative benchmarks.
Toward a New Metric for Ranking High Performance Computing Systems
A new high performance conjugate gradient (HPCG) benchmark is described, composed of computations and data access patterns more commonly found in applications that strive for a better correlation to real scientific application performance.
SPEC CPU2006 benchmark descriptions
On August 24, 2006, the Standard Performance Evaluation Corporation (SPEC) announced CPU2006 [2], which replaces CPU2000. The SPEC CPU benchmarks are widely used in both industry and academia [3].
LINPACK Users' Guide
General matrices Band matrices positive definite matrices Positive definite band matrices Symmetric Indefinite Matrices Triangular matrices Tridiagonal matrices The Cholesky decomposition The QR decomposition up to and including the singular value decomposition is studied.