• Corpus ID: 246210176

Trireme: Exploring Hierarchical Multi-Level Parallelism for Domain Specific Hardware Acceleration

  title={Trireme: Exploring Hierarchical Multi-Level Parallelism for Domain Specific Hardware Acceleration},
  author={Georgios Zacharopoulos and Adel Ejjeh and Ying Jing and En-Yu Yang and Tianyu Jia and Iulian Brumar and Jeremy Intan and Muhammad Huzaifa and Sarita V. Adve and Vikram S. Adve and Gu-Yeon Wei and David M. Brooks},
GEORGIOS ZACHAROPOULOS, Harvard University, USA ADEL EJJEH, University of Illinois at Urbana-Champaign, USA YING JING, University of Illinois at Urbana-Champaign, USA EN-YU YANG, Harvard University, USA TIANYU JIA, Harvard University, USA IULIAN BRUMAR, Harvard University, USA JEREMY INTAN, University of Illinois at Urbana-Champaign, USA MUHAMMAD HUZAIFA, University of Illinois at Urbana-Champaign, USA SARITA ADVE, University of Illinois at Urbana-Champaign, USA VIKRAM ADVE, University of… 



Spatial: a language and compiler for application accelerators

This work describes a new domain-specific language and compiler called Spatial for higher level descriptions of application accelerators, and summarizes the compiler passes required to support these abstractions, including pipeline scheduling, automatic memory banking, and automated design tuning driven by active machine learning.

FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDA-to-FPGA Compiler

An automated flow to perform efficient platform integration for an existing CUDA-to-RTL throughput oriented HLS is created, and the FCUDA tool, platform integration, and benchmark applications are open source.

Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures

Aladdin is presented, a pre-RTL, power-performance accelerator modeling framework and its application to system-on-chip (SoC) simulation and provides researchers an approach to model the power and performance of accelerators in an SoC environment.

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing

Experimental results show that HeteroCL allows programmers to explore the design space efficiently in both performance and accuracy by combining different types of hardware customization and targeting spatial architectures, while keeping the algorithm code intact.

Peruse and Profit: Estimating the Accelerability of Loops

This paper presents Peruse, a tool to characterize the features of loops in an application and to help the programmer understand the amenability of loops for acceleration and develops a machine-learning based model to predict the speedup of loops selected by Peruse.

Co-designing accelerators and SoC interfaces using gem5-Aladdin

It is shown that the optimal energy-delay-product of an accelerator microarchitecture can improve by up to 7.4× when system-level effects are considered compared to optimizing accelerators in isolation.

FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs

This work adapts the CUDA programming model into a new FPGA design flow called FCUDA, which efficiently maps the coarse and fine grained parallelism exposed in CUDA onto the reconfigurable fabric, and is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUda programming model for high-performance computing in FPGAs.

MachSuite: Benchmarks for accelerator design and customized architectures

This work presents MachSuite, a collection of 19 benchmarks for evaluating high-level synthesis tools and accelerator-centric architectures, which spans a broad application space, captures a variety of different program behaviors, and provides implementations tailored towards the needs of accelerator designers and researchers.

TAPAS: Generating Parallel Accelerators from Parallel Programs

TAPAS is a complete HLS toolchain for synthesizing parallel programs to accelerators and is open-sourced, and it is demonstrated TAPAS can generate accelerators for concurrent programs with heterogeneous, nested and recursive parallelism.

RegionSeeker: Automatically Identifying and Selecting Accelerators From Application Source Code

A method to identify subgraphs of control flow graphs having a single input control point and a single output control point that are good targets for the synthesis of application specific hardware accelerators and an LLVM-based toolchain that, analyzing a software application, automatically selects its most profitable regions given an area constraint is provided.