Nathalie Drach-Temam

Learn More
Hardware and software cache optimizations are active elds of research, that have yielded powerful but occasionally complex designs and algorithms. The purpose of this paper is to investigate the performance of combined though simple software and hardware optimizations. Because current caches provide little exibility for exploiting temporal and spatial(More)
In this paper we evaluate the performance of an SMT processor used as the geometry processor for a 3D polygonal rendering engine. To evaluate this approach, we consider PMesa (a parallel version of Mesa) which parallelizes the geometry stage of the 3D pipeline. We show that SMT is suitable for 3D geometry and we characterize the execution of the geometry(More)
Increasingly complex consumer electronics applications call for embedded processors with higher performance. Multi-cores are capable of delivering the required performance. However, many of these embedded applications must meet some form of soft real-time constraints, and program behavior on multi-cores is even harder to predict than on single-cores. In(More)
There are two major clifficnlties in implement ing prefet thing: avoicling stalling the cache because of prefetch operations, and maintaining coherence between prefet cl] recluests ancl the cache content.. The first constraint is critical because stalling the cache is likely to mean stalling the processor since superscalar processors can issue up to a cache(More)
This paper presents the performance of DSP, image and 3D applications on recent general-purpose microprocessors using streaming SIMD ISA extensions (integer and oating point). The 9 benchmarks benchmark we use for this evaluation have been optimized for DLP and caches use with SIMD extensions and data prefetch. The result of these cumulated optimizations is(More)
As technology enables to integrate real-time good quality 3D rendering in a single chip, the classical problem of the gap between internal data bandwidth and external memories arises. The texture mapping function requires a tremendous number of texture accesses and many past implementations have been based on costly high bandwidth external memory. Our(More)
We present a joint scheduling and memory allocation algorithm for efficient execution of task-parallel programs on non-uniform memory architecture (NUMA) systems. Task and data placement decisions are based on a static description of the memory hierarchy and on runtime information about intertask communication. Existing locality-aware scheduling strategies(More)
Architecturesparalì eles, bases de données, réseaux et systèmes distribués About cache associativity in low-cost shared memory multi-microprocessors Abstract: In 1993, sizes of on-chip caches on current commercial microprocessors range from 16K bytes to 36 Kbytes. These microprocessors can be directly used in the design of a low cost single-bus shared(More)