Optimal pipelining in supercomputers

@inproceedings{Kunkel1986OptimalPI,
  title={Optimal pipelining in supercomputers},
  author={Steven R. Kunkel and James E. Smith},
  booktitle={ISCA '86},
  year={1986}
}
This paper examines the relationship between the degree of central processor pipelining and performance. This relationship is studied in the context of modern supercomputers. Limitations due to instruction dependencies are studied via simulations of the CRAY-1S. Both scalar and vector code are studied. This study shows that instruction dependencies severely limit performance for scalar code as well as overall performance. The effects of latch overhead are then considered. The primary cause of… Expand
Complexity-Effective Superscalar Processors
TLDR
A microarchitecture that simplifies wakeup and selection logic is proposed and discussed, which will help minimize performance degradation due to slow bypasses in future wide-issue machines. Expand
The optimum pipeline depth for a microprocessor
TLDR
The impact of pipeline length on the performance of a microprocessor is explored both theoretically and by simulation, and two opposing architectural parameters affect the optimal pipeline length: the degree of instruction level parallelism (superscalar) decreases and the lack of pipeline stalls increases. Expand
The optimum pipeline depth for a microprocessor
The impact of pipeline length on the performance of a microprocessor is explored both theoretically and by simulation. An analytical theory is presented that shows two opposing architecturalExpand
Synchronous performance and reliability improvement in pipelined ASICs
  • T. Soyata, E. Friedman
  • Computer Science
  • Proceedings Seventh Annual IEEE International ASIC Conference and Exhibit
  • 1994
TLDR
An algorithm is presented by the authors for incorporating variable register delays, interconnect delay, and the clock skew into retiming and the results of applying the algorithm to MCNC benchmarks is presented and both performance and reliability improvements are observed. Expand
The performance potential of multiple functional unit processors
TLDR
It is found that in non-vector machines, pipelining multiple function units does not provide significant performance improvements, and it is worthwhile to investigate the performance improvements that can be achieved from issuing multiple instructions each clock cycle. Expand
Synchronization of pipelines
A recently formulated general timing model of synchronous operation is applied to the special case of latch-controlled pipelined circuits. The model accounts for multiphase synchronous clocking,Expand
Complexity and correctness of a super-pipelined processor
TLDR
This thesis lays the foundation for computing the time per instruction of the DLXπ+ for a given benchmark and different cycle times in future work in order to determine the “optimum” cycle time of the super-pipelined processor. Expand
The microarchitecture of superscalar processors
TLDR
The general problem solved by superscalar processors: converting an ostensibly sequential program into a more parallel one is discussed and the principles underlying this process, and the constraints that must be met, are discussed. Expand
Complexity-Effective Superscalar Processors
TLDR
This thesis proposes and valuates two new superscalar microarchitectures designed with the goal of achie v ng high performance by reducing comple xity, and proposes and e valuates the inte ger-decoupledmicroarchitecture that impro ves the performance of inte g r programs by minimally adding to a con ve tional microarch architecture. Expand
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays
TLDR
This study indicates that further pipelining can at best improve performance of integer programs by a factor of 2 over current designs, and proposes and evaluates a high-frequency design called a segmented instruction window. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 21 REFERENCES
Pipelining of Arithmetic Functions
Two addition and three multiplication algorithms were studied to see the effect of pipelining on system efficiency. A definition of efficiency was derived to compare the relative merits of variousExpand
The IBM System/360 model 91: machine philosophy and instruction-handling
TLDR
It is shown that history recording (the retention of complete instruction loops in the CPU) reduces the need to exercise storage, and that sophisticated employment of buffering techniques has reducedt he effective access time. Expand
Very high-speed computing systems
TLDR
The constituents of a system: storage, execution, and instruction handling (branching) are discussed with regard to recent developments and/or systems limitations. Expand
Measuring the Parallelism Available for Very Long Instruction Word Architectures
TLDR
This paper focuses on long instruction word architectures, such as attached scientific processors and horizontally microcoded CPU's, and argues that even if the authors had infinite hardware, these architectures could not provide a speedup of more than a factor of 2 or 3 on real programs. Expand
Circuit implementation of high-speed pipeline systems
  • L. Cotten
  • Computer Science
  • AFIPS '65 (Fall, part I)
  • 1965
TLDR
It now appears that 1 to 2 nanosecond hybrid integrated or full integrated logic circuits, practical fabrication of transmission line interconnections, packaging densities of 5000 logic gates per cubic foot in the machine environment, and 23 to 40nanosecond integrated scratchpad memories will be made available for systems being constructed over the next one to three year period. Expand
Percolation of Code to Enhance Parallel Dispatching and Execution
This note investigates the increase in parallel execution rate as a function of the size of an instruction dispatch stack with lookahead hardware. Under the constraint that instructions are notExpand
The Inhibition of Potential Parallelism by Conditional Jumps
TLDR
An infinite machine is postulate, one with an infinite memory and instruction stack, infinite registers and memory, and an infinite number of functional units, to execute a program in parallel at maximum speed by executing each instruction at the earliest possible moment. Expand
Detection and Parallel Execution of Independent Instructions
For a single instruction stream–single data stream organization the problem of simultaneously issuing several instructions is studied.
Applied Micro Circuits Corp
  • Q1500 Series Design Guide
  • 1985
Understanding Supercomputer Benchmarks Appendix Complete set of equations with unintentional clock skew
  • Datamation
  • 1984
...
1
2
3
...