Synergistic Processing in Cell's Multicore Architecture

@article{Gschwind2006SynergisticPI,
  title={Synergistic Processing in Cell's Multicore Architecture},
  author={Michael K. Gschwind and H. Peter Hofstee and Brian K. Flachs and Martin E. Hopkins and Yukio Watanabe and Takeshi Yamazaki},
  journal={IEEE Micro},
  year={2006},
  volume={26},
  pages={10-24}
}
Eight synergistic processor units enable the Cell Broadband Engine's breakthrough performance. The SPU architecture implements a novel, pervasively data-parallel architecture combining scalar and SIMD processing on a wide data path. A large number of SPUs per chip provide high thread-level parallelism. The streamlined architecture provides an efficient multithreaded execution environment for both scalar and SIMD threads and represents a reaffirmation of the RISC principles of combining leading… 

Figures from this paper

Heterogeneous Multi-core Processors: The Cell Broadband Engine

  • H. P. Hofstee
  • Computer Science
    Multicore Processors and Systems
  • 2009
TLDR
This chapter discusses how memory flow control and the synergistic processor unit architecture extend the Power Architecture™2, to allow the creation of heterogeneous implementations that attack the greatest sources of inefficiency in modern microprocessors.

Reconfigurable Functional Units for Scientific Superscalar Processors

TLDR
This paper discusses the design process and evaluates the RFUs' ability to implement instruction dataflow graphs from scientific workloads, and creates several different reconfigurable functional unit (RFU) designs for superscalar multi-processor supercomputers.

A survey of multicore processors

TLDR
Attributes common to all multicore processor implementations are covered, including application domain, power/performance, processing elements, memory system, and accelerators/integrated peripherals.

Integrated execution: A programming model for accelerators

TLDR
A new view of the architectural design choices that were made in consideration of software usability and application development for the Cell/B.E. processor is offered, including the concept of integrated executables that allow a single application to execute across multiple heterogeneous processor elements.

programming for accelerators

TLDR
A new view of the architectural design choices that were made in consideration of software usability and application development for the Cell/B.E. processor is offered and the concept of integrated executables that allow a single application to execute across multiple heterogeneous processor elements is explored.

Chip multiprocessing and the cell broadband engine

TLDR
How the Cell Broadband Enginetmuses parallelism at all levels of the system abstraction to deliver a quantum leap in application performance, and how the Cell Synergistic Memory Flow engine exploits compute-transfer level parallelism by providing efficient block transfer capabilities.

Cell GC: using the cell synergistic processor as a garbage collection coprocessor

TLDR
This work explores the idea of offloading Automatic Dynamic Garbage Collection from the host processor onto accelerator processors using the coprocessor paradigm, and implements BDW garbage collection on a Cell system and offload the mark phase to the SPE co-processor.

The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor

  • M. Gschwind
  • Computer Science
    International Journal of Parallel Programming
  • 2007
TLDR
The Cell Broadband Engine is described and the multiple levels at which its architecture exploits parallelism are described, taking advantage of opportunities at all levels of the system to deliver previously unattained levels of single chip performance.

Application profiling on Cell-based clusters

TLDR
This paper examines Cell-centric MPI programs on hybrid clusters containing multiple Opteron and Cell processors per node such as those used in the petascale Roadrunner system and presents a methodology for profiling parallel applications executing on the IBM PowerXCell 8i.

Computer Architecture with Associative Processor Replacing Last-Level Cache and SIMD Accelerator

TLDR
Comparative analysis supported by cycle-accurate simulation and emulation shows that this architecture may outperform a conventional computer architecture comprising a SIMD coprocessor and a shared last-level cache while consuming less power.
...

References

SHOWING 1-10 OF 13 REFERENCES

Power efficient processor architecture and the cell processor

  • H. P. Hofstee
  • Computer Science
    11th International Symposium on High-Performance Computer Architecture
  • 2005
TLDR
The paper discusses some of the challenges microprocessor designers face and provides motivation for performance per transistor as a reasonable first-order metric for design efficiency, and alternate architectural choices and some of its limitations are discussed.

Chip multiprocessing and the cell broadband engine

TLDR
How the Cell Broadband Enginetmuses parallelism at all levels of the system abstraction to deliver a quantum leap in application performance, and how the Cell Synergistic Memory Flow engine exploits compute-transfer level parallelism by providing efficient block transfer capabilities.

Power and performance optimization at the system level

The BlueGene/L supercomputer has been designed with a focus on power/performance efficiency to achieve high application performance under the thermal constraints of common data centers. To achieve

The design and implementation of a first-generation CELL processor

  • D. PhamS. Asano K. Yazawa
  • Computer Science
    ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005.
  • 2005
A CELL processor is a multi-core chip consisting of a 64b power architecture processor, multiple streaming processors, a flexible IO interface, and a memory interface controller. This SoC is

Introduction to the Cell multiprocessor

TLDR
This paper discusses the history of the project, the program objectives and challenges, the disign concept, the architecture and programming models, and the implementation of the Cell multiprocessor.

AltiVec Extension to PowerPC Accelerates Media Processing

TLDR
PowerPC's AltiVec speeds not only media processing but also nearly any application in which data parallelism exists, as demonstrated by a cycle-accurate simulation of Motorola's MPC 7400, the heart of Apple G4 systems.

IEEE Micro

    Optimizing pipelines for power and performance

    • V. SrinivasanD. Brooks P. Emma
    • Computer Science
      35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings.
    • 2002
    TLDR
    This paper presents an optimization methodology that starts with an analytical power-performance model to derive optimal pipeline depth for a superscalar processor and develops equations that model the variation of energy as a function of pipeline depth.

    Exploiting superword level parallelism with multimedia instruction sets

    TLDR
    This paper has developed a simple and robust compiler for detecting SLPP that targets basic blocks rather than loop nests, and is able to exploit parallelism both across loop iterations and within basic blocks.

    Optimizing Compiler for the CELL Processor

    TLDR
    Several compiler techniques that aim at automatically generating high quality codes over a wide range of heterogeneous parallelism available on the CELL processor are described and results indicate that significant speedup can be achieved with a high level of support from the compiler.