Nathalie Drach-Temam

Learn More
Hardware and software cache optimizations are active elds of research, that have yielded powerful but occasionally complex designs and algorithms. The purpose of this paper is to investigate the performance of combined though simple software and hardware optimizations. Because current caches provide little exibility for exploiting temporal and spatial(More)
There are two major clifficnlties in implement ing prefet thing: avoicling stalling the cache because of prefetch operations, and maintaining coherence between prefet cl] recluests ancl the cache content.. The first constraint is critical because stalling the cache is likely to mean stalling the processor since superscalar processors can issue up to a cache(More)
In this paper we evaluate the performance of an SMT processor used as the geometry processor for a 3D polygonal rendering engine. To evaluate this approach, we consider PMesa (a parallel version of Mesa) which parallelizes the geometry stage of the 3D pipeline. We show that SMT is suitable for 3D geometry and we characterize the execution of the geometry(More)
We present a joint scheduling and memory allocation algorithm for efficient execution of task-parallel programs on non-uniform memory architecture (NUMA) systems. Task and data placement decisions are based on a static description of the memory hierarchy and on runtime information about intertask communication. Existing locality-aware scheduling strategies(More)
Embedded systems based on FPGA (<i>Field-Programmable Gate Arrays</i>) must exhibit more performance for new applications. However, no high-performance superscalar soft processor is available on the FPGA, because the superscalar architecture is not suitable for FPGAs. High-performance superscalar processors execute instructions out-of-order and it is(More)
Increasingly complex consumer electronics applications call for embedded processors with higher performance. Multi-cores are capable of delivering the required performance. However, many of these embedded applications must meet some form of soft real-time constraints, and program behavior on multi-cores is even harder to predict than on single-cores. In(More)
We present Aftermath, an open source graphical tool designed to assist in the performance debugging process of task-parallel programs by visualizing, filtering and analyzing execution traces interactively. To efficiently exploit increasingly complex and concurrent hardware architectures, both the application and the run-time system that manages task(More)
This paper presents the performance of DSP, image and 3D applications on recent general-purpose microprocessors using streaming SIMD ISA extensions (integer and oating point). The 9 benchmarks benchmark we use for this evaluation have been optimized for DLP and caches use with SIMD extensions and data prefetch. The result of these cumulated optimizations is(More)
As technology enables to integrate real-time good quality 30 rendering in a single chip, the classicalproblem of the gap between internal data bandwidth and external memories arisea. The texture mapping function requires a twmendous number of texture accesses and many past implementations have been based on costly high bandwidth external memory. OUT impact(More)