An efficient algorithm for exploiting multiple arithmetic units

  title={An efficient algorithm for exploiting multiple arithmetic units},
  author={Robert Marco Tomasulo},
This paper describes the methods employed in the floating-point area of the System/360 Model 91 to exploit the existence of multiple execution units. Basic to these techniques is a simple common data busing and register tagging scheme which permits simultaneous execution of independent instructions while preserving the essential precedences inherent in the instruction stream. The common data bus improves performance by efficiently utilizing the execution units without requiring specially… 

Figures from this paper

An Instruction Fetch Unit for a High-Performance Personal Computer
The instruction fetch unit (IFU) of the Dorado personal computer speeds up the emulation of instructions by prefetching, decoding, and preparing later instructions in parallel with the execution of
Design of Efficient Dynamic Scheduling of RISC Processor Instructions
This modified Tomasulo's algorithm is implemented in conjunction with caches for writing this result and to attain coherency.
An elementary processor architecture with simultaneous instruction issuing from multiple threads
A multithreaded processor architecture which improves machine throughput and control functional unit conflicts between loop iterations, and a new static code scheduling technique which makes it possible to parallelize loops which are difficult to parallelizing in vector or VLIW machines.
OHMEGA : a VLSI superscalar processor architecture for numerical applications
  • M. Nakajima, H. Nakano, H. Kadota
  • Computer Science
    [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture
  • 1991
A VLSI superscalar processor architecture which can sustain very high performance in numericai applications and performs out-of-order issuing and executing of instructions to decrease the stall on the pipelines that dynamically occurs in execution is described.
A high performance Prolog processor with multiple function units
This work describes the Parallel Unification Machine (PLUM), a Prolog processor that exploits fine grain parallelism using multiple function units executing in parallel, and shows that PLUM with 3 Unification Units achieves an average speedup of approximately 3.4 over the Berkeley VLSI-PLM.
Instruction Scheduling in Microprocessors
This chapter gives a brief introduction to instruction scheduling on pipelined superscalar architectures, and explains some of the keystone static and dynamic instruction scheduling algorithms.
An efficient pipelined dataflow processor architecture
It is demonstrated that the principles of pipelined instruction execution can be effectively applied in data-flow computers, yielding an architecture that avoids the main sources of pipeline gaps
Instruction translation for an experimental S/390 processor
The hardware mechanisms used for mapping S/390 instructions to internal sequences of RISC instructions are introduced and the facilities, which provide a greater degree of flexibility are discussed.
Efficient Exploitation of Instruction-Level Parallelism for Superscalar Processors by the Conjugate Register File Scheme
The IAS-S compiler consists of two passes, prepass and postpass, and a scheduling-conflict graph is built for the register allocator during the prepass scheduling, which prevents inadequate register allocation.
Application-specific Configuration of Exposed Datapath Architectures
This work presents a novel idea called chaining of functional units which can be implemented in the SCAD to make it application-specific, and proposes two different methods of chaining which are proposed and evaluated qualitatively to find the most beneficial implementation.


The IBM System/360 model 91: machine philosophy and instruction-handling
It is shown that history recording (the retention of complete instruction loops in the CPU) reduces the need to exercise storage, and that sophisticated employment of buffering techniques has reducedt he effective access time.
The IBM system/360 model 91: floating-point execution unit
The principal requirement for the Model 91 floating-point execution unit was that it be designed to support the instructionissuing rate of the processor, so separate, instruction-oriented algorithms for the add, multiply, and divide functions were developed.
Tomasulo, “The System/360 Model 91: Machine Philosophy and Instruction Handling,
  • IBM Journal 11,
  • 1967
Received September
  • Received September
  • 1965