Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor Operations on Spatial Accelerators

  title={Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor Operations on Spatial Accelerators},
  author={Geonhwa Jeong and Gokcen Kestor and Prasanth Chatarasi and Angshuman Parashar and Po-An Tsai and Sivasankaran Rajamanickam and Roberto Gioiosa and Tushar Krishna},
  journal={2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT)},
  • Geonhwa Jeong, Gokcen Kestor, T. Krishna
  • Published 1 September 2021
  • Computer Science
  • 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT)
To meet the extreme compute demands for deep learning across commercial and scientific applications, dataflow accelerators are becoming increasingly popular. While these “domain-specific” accelerators are not fully programmable like CPUs and GPUs, they retain varying levels of flexibility with respect to data orchestration, i.e., dataflow and tiling optimizations to enhance efficiency. There are several challenges when designing new algorithms and mapping approaches to execute the algorithms… 
Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators
This article characterize the set of input operators and their mappings expressed in the MDC notation by introducing a set of conformability rules that result in a structured mapping space of the operators, which enables us to introduce a mapper based on the decoupled off-chip/on-chip approach to accelerate mapping space exploration.
Compiler-Driven Simulation of Reconfigurable Hardware Accelerators
This work designs the Event Queue (EQueue) dialect of MLIR, a dialect that can model arbitrary hardware accelerators with explicit data movement and distributed event-based control and implements a generic simulation engine to model EQueue programs with hybrid MLIR dialects representing different abstraction levels.
EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators
EcoFlow enables flexible and high-performance transpose and dilated convolutions on architectures that are otherwise optimized for CNN inference, and evaluates the efficiency of the dataflows on CNN training workloads and Generative Adversarial Network (GAN)Training workloads.


Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication
This work develops a framework that finds optimized mappings ( dataflow and tile sizes) for a tiled GEMM for a given spatial accelerator and workload combination, leveraging an analytical cost model for runtime and energy.
Marvel: A Data-centric Compiler for DNN Operators on Spatial Accelerators
A decoupled off- chip/on-chip approach that decomposes the mapping space into off-chip and on-chip subspaces, and first optimizes the off- Chip subspace followed by the on- chip subspace, to reduce the size of the search space dramatically and also to prioritize the optimization of off-Chip data movement.
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
TVM is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends and automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations.
A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim
This work demonstrates and analyzes the trade-off space for performance, DRAM bandwidth, and energy, and identifies sweet spots for various workloads and hardware configurations, and observes that a judicious choice of scaling can lead to performance improvements as high as 50 per layer, within the availableDRAM bandwidth.
Timeloop: A Systematic Approach to DNN Accelerator Evaluation
Timeloop's underlying models and algorithms are described in detail and results from case studies enabled by Timeloop are shown, which reveal that dataflow and memory hierarchy co-design plays a critical role in optimizing energy efficiency.
GAMMA: Automating the HW Mapping of DNN Models on Accelerators via Genetic Algorithm
This paper constructs an extremely flexible map-space and shows that GAMMA can explore the space and determine an optimized mapping with high sample efficiency, and quantitatively compare GAMMA with many popular optimization methods and observe GAMMA consistently finds better solutions.
cuDNN: Efficient Primitives for Deep Learning
A library similar in intent to BLAS, with optimized routines for deep learning workloads, that contains routines for GPUs, and similarly to the BLAS library, could be implemented for other platforms.
Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach
This work introduces a set of data-centric directives to concisely specify the DNN dataflow space in a compiler-friendly form and codifies this analysis into an analytical cost model, MAESTRO (Modeling Accelerator Efficiency via Patio-Temporal Reuse and Occupancy), that estimates various cost-benefit tradeoffs of a dataflow including execution time and energy efficiency for a DNN model and hardware configuration.
dMazeRunner: Executing Perfectly Nested Loops on Dataflow Accelerators
dMazeRunner is proposed -- to efficiently and accurately explore the vast space of the different ways to spatiotemporally execute a perfectly nested loop on dataflow accelerators (execution methods) and demonstrate that the solutions discovered by dMaze runner are on average 9.16× better in Energy-Delay-Product (EDP) and 5.83 × better in execution time, as compared to prior approaches.
SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training
  • Eric Qin, A. Samajdar, T. Krishna
  • Computer Science
    2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)
  • 2020
SIGMA is proposed, a flexible and scalable architecture that offers high utilization of all its processing elements (PEs) regardless of kernel shape and sparsity, and includes a novel reduction tree microarchitecture named Forwarding Adder Network (FAN).