In-memory Data Reorganization Performed in Parallel with Host Processor Memory Accesses and Provides Mechanisms to Handle Issues Such

Abstract

......Data reorganization operations often appear as a critical building block in scientific computing applications, such as signal processing, molecular dynamics simulations, and linear algebra computations (for example, matrix transpose, pack and unpack, and shuffle). High-performance libraries, such as the Intel Math Kernel Library (https://software. intel.com/en-us/intel-mkl), generally provide optimized implementations of these reorganization operations. Furthermore, data reorganizations often are employed as an optimization to improve application performance. Previous work demonstrated reorganization of the data layout into a more efficient format to improve performance. Physical data reorganization in memory is free of data dependencies and preserves the program semantics; it provides software-transparent performance improvement. However, reorganization operations incur significant energy and latency overheads in conventional systems, owing to limited data reuse, inefficient access patterns, and roundtrip data movement between the CPU and DRAM. Near-data processing (NDP) can be an effective solution to reduce the data movement between the processor and memory. By integrating processing capability into the memory, NDP allows localized computation where the data reside, reducing the round-trip data movement, energy, and latency overheads. Emerging 3D die stacking with through-silicon via (TSV) technology gives rise to a new interpretation of NDP concepts that were proposed decades ago. 3D-stacked DRAM exploits the TSV-based stacking technology and rearchitects the DRAM banks to achieve better timing and energy efficiency at a smaller area footprint. It substantially increases the internal bandwidth and reduces the internal access latency by eliminating pin count limitations. More interestingly, by integrating different DRAM and custom logic process technologies, it allows high-performance computing capability near memory. However, the stack’s power and thermal constraints limit this computing capability. Complete processing units integrated in the logic layer fall short in sustaining the Berkin Akin

6 Figures and Tables

Cite this paper

@inproceedings{Akin2016InmemoryDR, title={In-memory Data Reorganization Performed in Parallel with Host Processor Memory Accesses and Provides Mechanisms to Handle Issues Such}, author={Berkin Akin and Franz Franchetti and James C. Hoe}, year={2016} }