In the near future, cameras will be used everywhere as flexible sensors for numerous applications. For mobility and privacy reasons, the required image processing should be local on embedded computer platforms with performance requirements and energy constraints. Dedicated acceleration of Convolutional Neural Networks (CNN) can achieve these targets with(More)
Nearly all platforms use a multi-layer memory hierarchy to bridge the enormous latency gap between the large off-chip memories and local register files. However, most of previous work on HW or SW controlled techniques for layer assignment have been mainly focussed on performance. As a result, the intermediate layers have been assigned too large sizes(More)
We present a semi-automated method for the detection and exploitation of application domain specific instruction set extensions for embedded (VLIW) processors. It consists of three steps: the first step detects frequently occurring operation patterns, in the second step, the patterns are grouped and implemented in a number of Special Function Units (SFUs)(More)
Embedded multimedia systems often run multiple time-constrained applications simultaneously. These systems use multiprocessor systems-on-chip of which it must be guaranteed that enough resources are available for each application to meet its throughput constraints. This requires a task binding and scheduling mechanism that provides timing guarantees for(More)
This chapter presents a retargetable code generator specialized in the compilation of self-test programs and exploiting new techniques from Constraint Logic Programming (CLP). Firstly, we show how CLP can be exploited to improve the software production process especially for retargetable code generation and test generation. CLP combines the declarative(More)
As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality system-atically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is(More)
Graphics Processing Units (GPUs) are suitable for highly data parallel algorithms such as image processing, due to their massive parallel processing power. Many image processing applications use the histogramming algorithm, which fills a set of bins according to the frequency of occurrence of pixel values taken from an input image. Histogramming has been(More)