• Publications
  • Influence
Memory-centric accelerator design for Convolutional Neural Networks
It is shown that the effects of the memory bottleneck can be reduced by a flexible memory hierarchy that supports the complex data access patterns in CNN workload and ensures that on-chip memory size is minimized, which reduces area and energy usage.
High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs
This paper presents two novel histogramming methods, both achieving a higher performance and predictability than existing methods and guarantees to be fully data independent.
Multiprocessor systems synthesis for multiple use-cases of multiple applications on FPGA
Techniques are presented to merge multiple use-cases into one hardware design to minimize cost and design time, making it well suited for fast design-space exploration (DSE) in MPSoC systems.
Task-level timing models for guaranteed performance in multiprocessor networks-on-chip
This work proposes exact timing models that effectively co-model both the computation and communication of a job, including buffer models, based on interprocessor communication (IPC) graphs.
Dataflow Analysis for Real-Time Embedded Multiprocessor System Design
Dataflow analysis techniques are key to reduce the number of design iterations and shorten the design time of real-time embedded network based multiprocessor systems that process data streams. With
Predictable Embedded Multiprocessor System Design
Predictable heterogenous application domain specific multiprocessor systems, which are designed around a networks-on-chip, can meet demanding performance, flexibility and power-efficiency requirements as well as stringent timing requirements.
Inter-tile reuse optimization applied to bandwidth constrained embedded accelerators
A new analytical methodology to optimize nested loops for inter-tile data reuse with loop transformations like interchange and tiling is presented and it is demonstrated that small accelerators can boost a simple MicroBlaze soft-core to the performance level of a high-end Intel-i7 processor.
Skeleton-based automatic parallelization of image processing algorithms for GPUs
This paper presents a technique to automatically parallelize and map sequential code on a GPU, without the need for code-annotations, and uses domain specific skeletons and a finer-grained classification of algorithms.
MOVE-Pro: A low power and high code density TTA architecture
In a head-to-head comparison between a two-issue MOVE-Pro processor and its RISC counterpart, it is shown that up to 80% of RF accesses can be reduced, and the reduction in RF power is successfully transferred to the total core power saving.