• Corpus ID: 227334386

A Modern Primer on Processing in Memory

  title={A Modern Primer on Processing in Memory},
  author={Onur Mutlu and Saugata Ghose and Juan G'omez-Luna and Rachata Ausavarungnirun},
Modern computing systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in computing that cause performance, scalability and energy bottlenecks: (1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memory bandwidth and energy do not scale well, (2) energy consumption is a key limiter in almost all computing platforms, especially server and mobile systems, (3) data… 

PIM-Enclave: Bringing Confidential Computation Inside Memory

A novel design for Processing-In-Memory (PIM) as a dataintensive workload accelerator for confidential computing that can provide a side-channel resistant secure computation offloading and run data-intensive applications with negligible performance overhead compared to baseline PIM model is presented.

PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM

PiDRAM is designed and developed, the first flexible end-to-end framework that enables system integration studies and evaluation of real, commodity DRAM-based PuM techniques, and describes how to solve key integration challenges to make such techniques work and be effective on a real-system prototype.

Accelerating Neural Network Inference With Processing-in-DRAM: From the Edge to the Cloud

The analysis reveals that PIM greatly benefits memory-bound NNs, and concludes that the ideal PIM architecture for NN models depends on a model's distinct attributes, due to the inherent architectural design choices.

Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture, and presents PrIM ( Processing-In-Memory benchmarks), a benchmark suite of 16 workloads from different application domains, which are identified as memory-bound.

Casper: Accelerating Stencil Computations Using Near-Cache Processing

Casper is a near-cache accelerator consisting of specialized stencil computation units connected to the last-level cache (LLC) of a traditional CPU, based on two key ideas: avoiding the cost of moving rarely reused data throughout the cache hierarchy, and exploiting the regularity of the data accesses and the inherent parallelism of stencil computations to increase overall performance.

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

This research presents a probabilistic analysis of the response of the immune system to natural disasters to the presence of carbon dioxide in the environment.

Intelligent Architectures for Intelligent Computing Systems

  • O. Mutlu
  • Computer Science
    2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)
  • 2021
This invited special session talk describes three major shortcomings of modern architectures in terms of 1) dealing with data, 2) taking advantage of the vast amounts of data, and 3) exploiting different semantic properties of application data.

SIMDRAM: a framework for bit-serial SIMD processing using DRAM

This paper proposes SIMDRAM, a flexible general-purpose processing-using-DRAM framework that enables the efficient implementation of complex operations, and provides a flexible mechanism tosupport the implementation of arbitrary user-defined operations.

CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture

This work identifies the biggest system throughput bottleneck resulting from the mismatch of massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application, and proposes the CHARM framework to compose multiple diverse MM accelerator architectures working concurrently towards different layers within one application.

Fundamentally Understanding and Solving RowHammer

Two major directions are argued for to amplify research and development efforts in building a much deeper understanding of the RowHammer problem and its many dimensions, in both cutting-edge DRAM chips and computing systems deployed in the field, and the design and development of extremely efficient and fully-secure solutions via system-memory cooperation.



D-RaNGe: Using Commodity DRAM Devices to Generate True Random Numbers with Low Latency and High Throughput

D-RanGe is a methodology for extracting true random numbers from commodity DRAM devices with high throughput and low latency by deliberately violating the read access timing parameters and is evaluated using the commonly-used NIST statistical test suite for randomness.

The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Modern Commodity DRAM Devices

The DRAM latency PUF is introduced, a new class of fast, reliable DRAM PUFs that satisfy runtime-accessible PUF requirements and are quickly generated irrespective of operating temperature using a real system with no additional hardware modications.

Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM

A new DRAM substrate, Low-Cost Inter-Linked Subarrays (LISA), whose goal is to enable fast and efficient data movement across a large range of memory at low cost, and whose combined benefit is higher than the benefit of each alone, on a variety of workloads and system configurations.

PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture

A new PIM architecture is proposed that does not change the existing sequential programming models and automatically decides whether to execute PIM operations in memory or processors depending on the locality of data, and combines the best parts of conventional and PlM architectures by adapting to data locality of applications.

ComputeDRAM: In-Memory Compute Using Off-the-Shelf DRAMs

This work is the first work to demonstrate in-memory computation with off-the-shelf, unmodified, commercial, DRAM, by violating the nominal timing specification and activating multiple rows in rapid succession, which happens to leave multiple rows open simultaneously, thereby enabling bit-line charge sharing.

MAGIC—Memristor-Aided Logic

In this brief, a memristor-only logic family, i.e., memristar-aided logic (MAGIC), is presented, and in each MAGIC logic gate, memristors serve as an input with previously stored data, and an additional Memristor serves as an output.

Memristor-based IMPLY logic design procedure

The design and behavior of a memristive-based logic gate - an IMPLY gate - are presented and design issues such as the tradeoff between speed (fast write times) and correct logic behavior are described, as part of an overall design methodology.

Memristor-Based Material Implication (IMPLY) Logic: Design Principles and Methodologies

The IMPLY logic gate, a memristor-based logic circuit, is described and a methodology for designing this logic family is proposed, based on a general design flow suitable for all deterministic memristive logic families.

Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost

  • TACO
  • 2016

Exploiting Near-Data Processing to Accelerate Time Series Analysis

A time series is a chronologically ordered set of samples of a real-valued variable that can contain millions of observations. Time series analysis is used to analyze information in a wide variety of