Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor Operations on Spatial Accelerators
@article{Jeong2021UnionAU, title={Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor Operations on Spatial Accelerators}, author={Geonhwa Jeong and Gokcen Kestor and Prasanth Chatarasi and Angshuman Parashar and Po-An Tsai and Sivasankaran Rajamanickam and Roberto Gioiosa and Tushar Krishna}, journal={2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT)}, year={2021}, pages={30-44} }
To meet the extreme compute demands for deep learning across commercial and scientific applications, dataflow accelerators are becoming increasingly popular. While these “domain-specific” accelerators are not fully programmable like CPUs and GPUs, they retain varying levels of flexibility with respect to data orchestration, i.e., dataflow and tiling optimizations to enhance efficiency. There are several challenges when designing new algorithms and mapping approaches to execute the algorithms…
Figures and Tables from this paper
3 Citations
Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators
- Computer ScienceACM Transactions on Architecture and Code Optimization
- 2022
This article characterize the set of input operators and their mappings expressed in the MDC notation by introducing a set of conformability rules that result in a structured mapping space of the operators, which enables us to introduce a mapper based on the decoupled off-chip/on-chip approach to accelerate mapping space exploration.
Compiler-Driven Simulation of Reconfigurable Hardware Accelerators
- Computer Science2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)
- 2022
This work designs the Event Queue (EQueue) dialect of MLIR, a dialect that can model arbitrary hardware accelerators with explicit data movement and distributed event-based control and implements a generic simulation engine to model EQueue programs with hybrid MLIR dialects representing different abstraction levels.
EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators
- Computer ScienceArXiv
- 2022
EcoFlow enables flexible and high-performance transpose and dilated convolutions on architectures that are otherwise optimized for CNN inference, and evaluates the efficiency of the dataflows on CNN training workloads and Generative Adversarial Network (GAN)Training workloads.
References
SHOWING 1-10 OF 41 REFERENCES
Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication
- Computer ScienceIEEE Transactions on Parallel and Distributed Systems
- 2022
This work develops a framework that finds optimized mappings ( dataflow and tile sizes) for a tiled GEMM for a given spatial accelerator and workload combination, leveraging an analytical cost model for runtime and energy.
Marvel: A Data-centric Compiler for DNN Operators on Spatial Accelerators
- Computer Science
- 2020
A decoupled off- chip/on-chip approach that decomposes the mapping space into off-chip and on-chip subspaces, and first optimizes the off- Chip subspace followed by the on- chip subspace, to reduce the size of the search space dramatically and also to prioritize the optimization of off-Chip data movement.
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
- Computer ScienceOSDI
- 2018
TVM is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends and automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations.
A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim
- Computer Science2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
- 2020
This work demonstrates and analyzes the trade-off space for performance, DRAM bandwidth, and energy, and identifies sweet spots for various workloads and hardware configurations, and observes that a judicious choice of scaling can lead to performance improvements as high as 50 per layer, within the availableDRAM bandwidth.
Timeloop: A Systematic Approach to DNN Accelerator Evaluation
- Computer Science2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
- 2019
Timeloop's underlying models and algorithms are described in detail and results from case studies enabled by Timeloop are shown, which reveal that dataflow and memory hierarchy co-design plays a critical role in optimizing energy efficiency.
GAMMA: Automating the HW Mapping of DNN Models on Accelerators via Genetic Algorithm
- Computer Science2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD)
- 2020
This paper constructs an extremely flexible map-space and shows that GAMMA can explore the space and determine an optimized mapping with high sample efficiency, and quantitatively compare GAMMA with many popular optimization methods and observe GAMMA consistently finds better solutions.
cuDNN: Efficient Primitives for Deep Learning
- Computer ScienceArXiv
- 2014
A library similar in intent to BLAS, with optimized routines for deep learning workloads, that contains routines for GPUs, and similarly to the BLAS library, could be implemented for other platforms.
Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach
- Computer ScienceMICRO
- 2019
This work introduces a set of data-centric directives to concisely specify the DNN dataflow space in a compiler-friendly form and codifies this analysis into an analytical cost model, MAESTRO (Modeling Accelerator Efficiency via Patio-Temporal Reuse and Occupancy), that estimates various cost-benefit tradeoffs of a dataflow including execution time and energy efficiency for a DNN model and hardware configuration.
dMazeRunner: Executing Perfectly Nested Loops on Dataflow Accelerators
- Computer ScienceACM Trans. Embed. Comput. Syst.
- 2019
dMazeRunner is proposed -- to efficiently and accurately explore the vast space of the different ways to spatiotemporally execute a perfectly nested loop on dataflow accelerators (execution methods) and demonstrate that the solutions discovered by dMaze runner are on average 9.16× better in Energy-Delay-Product (EDP) and 5.83 × better in execution time, as compared to prior approaches.
SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training
- Computer Science2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)
- 2020
SIGMA is proposed, a flexible and scalable architecture that offers high utilization of all its processing elements (PEs) regardless of kernel shape and sparsity, and includes a novel reduction tree microarchitecture named Forwarding Adder Network (FAN).