SpZip: Architectural Support for Effective Data Compression In Irregular Applications

  title={SpZip: Architectural Support for Effective Data Compression In Irregular Applications},
  author={Yifan Yang and Joel S. Emer and Daniel S{\'a}nchez},
  journal={2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)},
Irregular applications, such as graph analytics and sparse linear algebra, exhibit frequent indirect, data-dependent accesses to single or short sequences of elements that cause high main memory traffic and limit performance. Data compression is a promising way to accelerate irregular applications by reducing memory traffic. However, software compression adds substantial overheads, and prior hardware compression techniques work poorly on the complex access patterns of irregular applications.We… 


Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design
Prodigy is presented, a low-cost hardware-software codesign solution for intelligent prefetching to improve the memory latency of several important irregular workloads and compares the performance of Prodigy against a non-prefetching baseline as well as state-of-the-art prefetchers.
Linearly compressed pages: A low-complexity, low-latency main memory compression framework
It is shown that any compression algorithm can be adapted to fit the requirements of LCP, and two previously-proposed compression algorithms to LCP are adapted: Frequent Pattern Compression and Base-Delta-Immediate Compression.
Optimizing indirect memory references with milk
Modern applications such as graph and data analytics, when operating on real world data, have working sets much larger than cache capacity and are bottlenecked by DRAM. To make matters worse, DRAM
Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures
This paper presents a novel and lightweight compression algorithm, Bit-Plane Compression (BPC), to increase the effective memory bandwidth and reduces memory bandwidth requirements significantly.
An Event-Triggered Programmable Prefetcher for Irregular Workloads
An event-triggered programmable prefetcher combining the flexibility of a general-purpose computational unit with an event-based programming model, along with compiler techniques to automatically generate events from the original source code with annotations is proposed.
Pipette: Improving Core Utilization on Irregular Applications through Intra-Core Pipeline Parallelism
Pipette is presented, a technique that enables cheap pipeline parallelism within each core using architecturally visible queues and avoids load imbalance and achieves high core IPC by time-multiplexing stages on the same core.
IMP: Indirect memory prefetcher
This work proposes an efficient hardware indirect memory prefetcher (IMP) to capture this access pattern and hide latency, and proposes a partial cacheline accessing mechanism for these prefetches to reduce the network and DRAM bandwidth pressure from the lack of spatial locality.
SC2: A statistical compression cache scheme
This paper presents, for the first time, a detailed design-space exploration of caches that utilize statistical compression and shows that more aggressive approaches like Huffman coding, which have been neglected in the past due to the high processing overhead for (de)compression, are suitable techniques for caches and memory.
When is Graph Reordering an Optimization? Studying the Effect of Lightweight Graph Reordering Across Applications and Input Graphs
This work identifies lightweight re ordering techniques that improve performance even after accounting for the overhead of reordering, and addresses a major impediment to the general adoption of these reordering techniques - input-dependent speedups – by linking the speedup from lightweight reordering to structural properties of the input graph.
QEI: Query Acceleration Can be Generic and Efficient in the Cloud
This paper proposes QEI, a generic, integrated, and efficient acceleration solution for various data structure queries that allows multiple query operations to execute in parallel to maximize throughput and proposes a novel way to integrate the accelerator into the CPU that balances performance, latency, and hardware cost.