Corpus ID: 235212485

Compiling Halide Programs to Push-Memory Accelerators

@article{Liu2021CompilingHP,
  title={Compiling Halide Programs to Push-Memory Accelerators},
  author={Qiaoyi Liu and Dillon Huff and Jeff Setter and Maxwell Strange and Kathleen Feng and Kavya Sreedhar and Ziheng Wang and Keyi Zhang and Mark Horowitz and Priyanka Raina and Fredrik Kjolstad},
  journal={ArXiv},
  year={2021},
  volume={abs/2105.12858}
}
Image processing and machine learning applications benefit tremendously from hardware acceleration, but existing compilers target either FPGAs, which sacrifice power and performance for flexible hardware, or ASICs, which rapidly become obsolete as applications change. Programmable domain-specific accelerators have emerged as a promising middle-ground between these two extremes, but such architectures have traditionally been difficult compiler targets. The main obstacle is that these… Expand

References

SHOWING 1-10 OF 41 REFERENCES
Spatial: a language and compiler for application accelerators
TLDR
This work describes a new domain-specific language and compiler called Spatial for higher level descriptions of application accelerators, and summarizes the compiler passes required to support these abstractions, including pipeline scheduling, automatic memory banking, and automated design tuning driven by active machine learning. Expand
Plasticine: A reconfigurable architecture for parallel patterns
TLDR
This work designs Plasticine, a new spatially reconfigurable architecture designed to efficiently execute applications composed of parallel patterns that provide an improvement of up to 76.9× in performance-per-Watt over a conventional FPGA over a wide range of dense and sparse applications. Expand
Clockwork: Resource-Efficient Static Scheduling for Multi-Rate Image Processing Applications on FPGAs
TLDR
This work presents a new algorithm for compiling image processing applications to hardware, Clockwork, that combines insights from polyhedral analysis and synchronous dataflow to overcome limitations of existing image processing hardware compilers. Expand
Programming Heterogeneous Systems from an Image Processing DSL
TLDR
The image processing language Halide is extended so users can specify which portions of their applications should become hardware accelerators, and a compiler is provided that uses this code to automatically create the accelerator along with the “glue” code needed for the user’s application to access this hardware. Expand
HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing
TLDR
Experimental results show that HeteroCL allows programmers to explore the design space efficiently in both performance and accuracy by combining different types of hardware customization and targeting spatial architectures, while keeping the algorithm code intact. Expand
HeteroHalide: From Image Processing DSL to Efficient FPGA Acceleration
TLDR
HeteroHalide, an end-to-end system for compiling Halide programs to FPGA accelerators that makes use of both algorithm and scheduling information specified in a Halide program, is proposed. Expand
Stream-dataflow acceleration
TLDR
A general architecture which can more efficiently expresses programs with broad common properties called stream-dataflow is defined, which enables high concurrency, and the stream component enables communication and coordination at very-low power and area overhead. Expand
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
TLDR
A systematic model of the tradeoff space fundamental to stencil pipelines is presented, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule are presented. Expand
Darkroom: compiling high-level image processing code into hardware pipelines
TLDR
The semantics of the Darkroom language allow it to compile programs directly into line-buffered pipelines, with all intermediate values in local line-buffer storage, eliminating unnecessary communication with off-chip DRAM. Expand
VTA: An Open Hardware-Software Stack for Deep Learning
TLDR
This work proposes VTA, a programmable deep learning architecture template designed to be extensible in the face of evolving workloads, and proposes a runtime system equipped with a JIT compiler for flexible code-generation and heterogeneous execution that enables effective use of the VTA architecture. Expand
...
1
2
3
4
5
...