Temporal Vectorization: A Compiler Approach to Automatic Multi-Pumping

  title={Temporal Vectorization: A Compiler Approach to Automatic Multi-Pumping},
  author={Carl-Johannes Johnsen and Tiziano De Matteis and Tal Ben-Nun and Johannes de Fine Licht and Torsten Hoefler},
  journal={Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design},
The multi-pumping resource sharing technique can overcome the limitations commonly found in single-clocked FPGA designs by allowing hardware components to operate at a higher clock frequency than the surrounding system. However, this optimization cannot be expressed in high levels of abstraction, such as HLS, requiring the use of hand-optimized RTL. In this paper we show how to leverage multiple clock domains for computational subdomains on reconfigurable devices through data movement analysis… 

Figures and Tables from this paper

Python FPGA Programming with Data-Centric Multi-Level Design

This work presents the HLS-based FPGA code generation backend of DaCe, and shows how SDFGs are code generated for either FPGAs vendor, emitting efficient HLS code that is structured and annotated to implement the desired architecture.



Multi-pumping for resource reduction in FPGA high-level synthesis

This paper proposes a new approach to resource sharing that allows multiple operations to be performed by a single functional unit in one clock cycle, based on multi-pumping, which operates functional units at a higher frequency than the surrounding system logic, typically 2×, allowing multiple computations to complete in a single system cycle.

Stateful dataflow multigraphs: a data-centric model for performance portability on heterogeneous architectures

The Stateful DataFlow multiGraph (SDFG), a data-centric intermediate representation that enables separating program definition from its optimization, is presented, allowing domain scientists to develop applications naturally and port them to approach peak hardware performance without modifying the original scientific code.

Transformations of High-Level Synthesis Codes for High-Performance Computing

A collection of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications, is presented, aiming to establish a common toolbox to guide both performance engineers and compiler engineers in tapping into the performance potential offered by spatial computing architectures using HLS.

Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis

This work presents a model to optimize matrix multiplication for FPGA platforms, simultaneously targeting maximum performance and minimum off-chip data movement, within constraints set by the hardware.

StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems

The general case of mapping directed acyclic graphs of heterogeneous stencil computations to spatial computing systems, assuming large input programs without an iterative component is considered, and StencilFlow maximizes temporal locality and ensures deadlock freedom in this setting.

Multipumping Flexible DSP Blocks for Resource Reduction on Xilinx FPGAs

This paper demonstrates multipumping for resource sharing of the flexible DSP48E1 macros in Xilinx FPGAs to enable resource sharing for the full set of supported DSP block operations, and compares this to multipumping only multipliers and DSP blocks with fixed configurations.

AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs

AutoBridge is proposed, an automated framework that couples a coarse-grained floorplanning step with pipelining during HLS compilation that improves the average frequency from 147 MHz to 297 MHz with no loss of throughput and a negligible change in resource utilization.

Efficient Implementations of Multi-pumped Multi-port Register Files in FPGAs

A new design which exploits the banking and replication of BRAMs with efficient shift register based multi-pumping (SR-MPu) approach is proposed which is independent from MPu factor and occupies lower logic resources up to 47% when compared with other design methods.

Impact of Cache Architecture and Interface on Performance and Area of FPGA-Based Processor/Parallel-Accelerator Systems

This work looks at how the number of cache ports affects performance when multiple hardware accelerators operate (and access memory) in parallel, and evaluates two different hardware implementations of multi-ported caches using multi-pumping and a recently-published approach based on the concept of a live-value table.

From software to accelerators with LegUp high-level synthesis

This paper presents on overview of the LegUp design methodology and system architecture, and discusses ongoing work on profiling, hardware/software partitioning, hardware accelerator quality improvements, Pthreads/OpenMP support, visualization tools, and debugging support.