Texture Caches

  title={Texture Caches},
  author={Michael C. Doggett},
  journal={IEEE Micro},
  • M. Doggett
  • Published 1 May 2012
  • Computer Science
  • IEEE Micro
This column examines the texture cache, an essential component of modern GPUs that plays an important role in achieving real-time performance when generating realistic images. GPUs have many components and the texture cache is only one of them. But it has a real impact on the performance of the GPU if rasterization and memory tiling are set up correctly. 

Figures from this paper

Architecture Design for a Four-Way Pipelined Parallel Texture Engine

A dedicated hardware architecture of texture engine for 3D graphics engine based on OpenGL 3.0 and GLSL 1.40 with a number of novel features, such as an optimized full-purpose four-way pipelined Parallel texel data formatters and filters, multi-port multi-bank non-blocking texture cache.

Efficient management of last-level caches in graphics processors for 3D scene rendering workloads

This paper characterize the intra-stream and inter-stream reuses in 52 frames captured from eight DirectX game titles and four DirectX benchmark applications and proposes graphics stream-aware probabilistic caching (GSPC) that dynamically learns the reuse probabilities and accordingly manages the LLC of the GPU.

GPUpd: A Fast and Scalable Multi-GPU Architecture Using Cooperative Projection and Distribution

GPUpd is proposed, a novel multi-GPU architecture for fast and scalable split frame rendering (SFR) and introduces a new graphics pipeline stage called Cooperative Projection & Distribution (C-PD) where all GPUs cooperatively project 3D objects to 2D screen and effciently redistribute the objects to their corresponding GPUs.

Quantifying the NUMA Behavior of Partitioned GPGPU Applications

A framework that allows analyzing the internal communication behavior of GPGPU applications, consisting of an open-source memory tracing plugin for Clang/LLVM, and a simple communication model based on summaries of a kernel's memory accesses that allows reasoning about virtual bandwidth-limited communication paths between NUMA nodes using different partitioning strategies is introduced.

Romou: rapidly generate high-performance tensor kernels for mobile GPUs

A mobile-GPU-specific kernel compiler Romou is proposed, which supports the unique hardware feature in kernel implementation, and prunes inefficient ones against hardware resources, and can thus rapidly generate high-performance kernels.

Reviewing GPU architectures to build efficient back projection for parallel geometries

This article builds a performance model to find hardware hotspots and proposes several optimizations to balance the load between texture engine, computational and special function units, as well as different types of memory maximizing the utilization of all GPU subsystems in parallel.

GPU-based implementation of an optimized nonparametric background modeling for real-time moving object detection

This paper presents a novel real-time implementation of an optimized spatio-temporal nonparametric moving object detection strategy that features smart cooperation between a computer/device's Central and Graphics Processing Units and extensive usage of the texture mapping and filtering units of the latter, including a novel method for fast evaluation of Gaussian functions.

AVR: Reducing Memory Traffic with Approximate Value Reconstruction

Approximate Value Reconstruction (AVR) reduces the memory traffic of applications that tolerate approximations in their dataset improving significantly system performance and energy efficiency and supports the compression scheme maximizing its effect and minimizing its overheads.

MemSZ: Squeezing Memory Traffic with Lossy Compression

MemSZ introduces a low latency, parallel design of the Squeeze (SZ) algorithm offering aggressive compression ratios, up to 16:1 in the authors' implementation, and improves the execution time, energy, and memory traffic by up to 15%, 9%, and 64%, respectively.



The Design And Analysis Of A Cache Architecture For Texture Mapping

  • Z. S. HakuraAnoop Gupta
  • Computer Science
    Conference Proceedings. The 24th Annual International Symposium on Computer Architecture
  • 1997
The use of texture image caches are proposed to alleviate the above bottlenecks, and indicate that caching is a promising approach to designing memory systems for texture mapping.

Fermi GF100 GPU Architecture

The Fermi GF100 is a GPU architecture that provides several new capabilities beyond the Nvidia GT200 or Tesla architecture, including tessellation, physics processing, and computational graphics.

Prefetching in a texture cache architecture

This paper introduces a prefetching texture cache architecture designed to take advantage of the access characteristics of texture mapping, and demonstrates that even in the presence of a high-latency memory system, this architecture can attain at least 97% of the performance of a zerolatency memory systems.

Rise of the Graphics Processor

  • D. Blythe
  • Computer Science
    Proceedings of the IEEE
  • 2008
This work examines some of this evolution of hardware to accelerate graphics processing operations, looks at the structure of a modern GPU, and discusses how graphics processing exploits this structure and how nongraphical applications can take advantage of this capability.

Larrabee: A Many-Core x86 Architecture for Visual Computing

The Larrabee many-core visual computing architecture uses multiple in-order x86 cores augmented by wide vector processor units, together with some fixed-function logic. This increases the

Demystifying GPU microarchitecture through microbenchmarking

This work develops a microbechmark suite and measures the CUDA-visible architectural characteristics of the Nvidia GT200 (GTX280) GPU, exposing undocumented features that impact program performance and correctness.

Reducing shading on GPUs using quad-fragment merging

It is found that a fragment-shading pipeline with this optimization is competitive with the REYES pipeline approach of shading at micropolygon vertices and, in cases of complex occlusion, can perform up to two times less shading work.

Hardware for Superior Texture Performance

This work will focus on the use of a specific compression scheme for texture mapping, which allows theuse of a very simple and fast decompression hardware, bringing high performance texture mapping to low-cost systems.

Neon: a single-chip 3D workstation graphics accelerator

High-performance 3D graphics accelerators traditionally require multiple chips on multiple boards, including geometry, rasterizing, pixel processing, and texture mapping chips. These designs are

Larrabee: A many-Core x86 architecture for visual computing

  • D. Carmean
  • Art
    2008 IEEE Hot Chips 20 Symposium (HCS)
  • 2008
This article consists of a collection of slides from the author's conference presentation. Some of the topics discussed include: architecture convergence; Larrabee architecture; and graphics pipeline.