Corpus ID: 236088078

NERO: Accelerating Weather Prediction using Near-Memory Reconfigurable Fabric

@article{Singh2021NEROAW,
  title={NERO: Accelerating Weather Prediction using Near-Memory Reconfigurable Fabric},
  author={Gagandeep Singh and Dionysios Diamantopoulos and Juan G'omez-Luna and Christoph Hagleitner and Sander Stuijk and Henk Corporaal and Onur Mutlu},
  journal={ArXiv},
  year={2021},
  volume={abs/2107.08716}
}
Ongoing climate change calls for fast and accurate weather and climate modeling. However, when solving large-scale weather prediction simulations, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumption. These implementations are dominated by complex irregular memory access patterns and low arithmetic intensity that pose fundamental challenges to acceleration. To overcome these challenges, we propose and evaluate the use of near-memory… Expand

Figures and Tables from this paper

References

SHOWING 1-10 OF 149 REFERENCES
NARMADA: Near-Memory Horizontal Diffusion Accelerator for Scalable Stencil Computations
TLDR
This work offloads a horizontal diffusion kernel, which is a compound stencil kernel, from the COSMO weather prediction application to a reconfigurable fabric, and introduces a memory hierarchy tailored to the targeted application and using a coherent memory model, which improves memory efficiency. Expand
Porting the COSMO Weather Model to Manycore CPUs
TLDR
This work demonstrates how an existing domain-specific language that has been designed for CPUs and GPUs can be extended to Manycore architectures such as KNL and finds that optimizing code to full performance on modern manycore architectures requires similar effort and hardware knowledge as for GPUs. Expand
ecTALK: Energy efficient coherent transprecision accelerators — The bidirectional long short-term memory neural network case
TLDR
An end-to-end architecture to improve the energy efficiency by using an FPGA device for accelerating applications, and introducing flexible reduced-precision (transprecision) data-paths is proposed. Expand
NAPEL: Near-Memory Computing Application Performance Prediction via Ensemble Learning
TLDR
NAPEL is presented, a high-level performance and energy estimation framework for NMC architectures that leverages ensemble learning to develop a model that is based on micro architectural parameters and application characteristics and is capable of making accurate predictions for previously-unseen applications. Expand
CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators
TLDR
CoNDA is proposed, a coherence mechanism that lets an NDA optimistically execute an Nda kernel, under the assumption that the NDA has all necessary coherence permissions, and allows CoNDA to gather information on the memory accesses performed by the Nda and by the rest of the system. Expand
Accelerating arithmetic kernels with coherent attached FPGA coprocessors
TLDR
The results show that the coherent attached accelerator outperforms device driver based approaches in terms of latency and the integration of CAPI into heterogeneous programming frameworks such as OpenCL will facilitate latency critical operations and will further enhance programmability of hybrid systems. Expand
TOP-PIM: throughput-oriented programmable processing in memory
TLDR
This work explores the use of 3D die stacking to move memory-intensive computations closer to memory and introduces a methodology for rapid design space exploration by analytically predicting performance and energy of in-memory processors based on metrics obtained from execution on today's GPU hardware. Expand
Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems
TLDR
It is demonstrated that data buffers in a load-reduced DIMM (LRDIMM), which was originally developed to support large memory systems for servers, are supreme places to integrate near-DRAM accelerators and proposed Chameleon, an NDA architecture that can be realized without relying on 3D/2.5D-stacking technology. Expand
Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL
TLDR
This work creates a stencil accelerator using Intel FPGA SDK for OpenCL that achieves high performance without having such restrictions by combining spatial and temporal blocking to avoid input size restrictions, and employs multiple FPGAs-specific optimizations to tackle issues arisen from the added design complexity. Expand
Practical Near-Data Processing for In-Memory Analytics Frameworks
TLDR
This paper develops the hardware and software of an NDP architecture for in-memory analytics frameworks, including MapReduce, graphprocessing, and deep neural networks, and shows that it is critical to optimize software frameworks for spatial locality as it leads to 2.9x efficiency improvements for NDP. Expand
...
1
2
3
4
5
...