Learn More
Intelligently partitioning the last-level cache within a chip multiprocessor can bring significant performance improvements. Resources are given to the applications that can benefit most from them, restricting each core to a number of logical cache ways. However, although overall performance is increased, existing schemes fail to consider energy saving when(More)
Compiler-based error detection methodologies replicate the instructions of the program and insert checks wherever it is needed. The checks evaluate code correctness and decide whether or not an error has occurred. The replicated instructions and the checks cause a large slowdown. In this work, we focus on reducing the error detection overhead and improving(More)
SIMD vectors are widely adopted in modern general purpose processors as they can boost performance and energy efficiency for certain applications. Compiler-based automatic vectorization is one approach for generating codethat makes efficient use of the SIMD units, and has the benefit of avoiding hand development and platform-specific optimizations. The(More)
Topic: Portable compiler optimizations for parallel computing. My research focuses on the evaluation and tuning of compiler transformations for graphics processors. The final goal is to develop a single optimizing compiler capable of automatically achieving performance portability across devices of different generations and vendors. My main research tools(More)
Designing high-performance software queues for fast intercore communication is challenging, but critical for maximising software parallelism. State-of-the-art single-producer / single-consumer queues for streaming applications contain multiple sections, requiring the producer and consumer to operate independently on different sections from each other. While(More)
Clustered VLIW architectures are statically scheduled wide-issue architectures that combine the advantages of wide-issue processors along with the power and frequency scalability of clustered designs. Being statically scheduled, they require that the decision of mapping instructions to clusters be done by the compiler. State-of-the-art code generation for(More)
We present a decoupled architecture of processors with a memory hierarchy of only scratch-pad memories, and a main memory. The decoupled architecture also exploits the parallelism between address computation and processing the application data. The application code is split in two programs the first for computing the addresses of the data in the memory(More)
This paper present a decoupled architecture of processors with a memory hierarchy of only scratch-pad memories, and a main memory. The decoupled architecture also exploits the parallelism between address computation and processing the application data. The application code is split in two programs the first for computing the addresses of the data in the(More)