Matrix scheduler reloaded
@inproceedings{Sassone2007MatrixSR,
title={Matrix scheduler reloaded},
author={Peter G. Sassone and Jeff Rupley and Edward Brekelbaum and Gabriel H. Loh and Bryan Black},
booktitle={International Symposium on Computer Architecture},
year={2007}
}From multiprocessor scale-up to cache sizes to the number of reorder-buffer entries, microarchitects wish to reap the benefits of more computing resources while staying within power and latency bounds. This tension is quite evident in schedulers, which need to be large and single-cycle for maximum performance on out-of-order cores. In this work we present two straightforward modifications to a matrix scheduler implementation which greatly strengthen its scalability. Both are based on the simple…
Figures and Tables from this paper
43 Citations
Federation: RepurposingScalarCoresforOut-of-Order InstructionIssue
- Computer Science
- 2008
This paper proposes a way to repurpose a pair of scalar cores into a 2-way out-of-order issue core with minimal area overhead and achieves comparable performance to a dedicated out- of-order core and dissipates less power as well.
A physical-level study of the compacted matrix instruction scheduler for dynamically-scheduled superscalar processors
- Computer Science2009 International Symposium on Systems, Architectures, Modeling, and Simulation
- 2009
This work investigates the latency and energy variations of the compacted matrix and its accompanying logic as a function of the issue width, the window size, and the number of global recovery checkpoints and proposes an energy optimization that throttles unnecessary pre-charges and evaluations.
Forwardflow : Scalable , RAM-Based Dataflow Execution
- Computer Science
- 2008
This work presents the Forwardflow microarchitecture, which executes instructions out-of-order using RAM-based structures in lieu of non-scalable CAMor matrix-based mechanisms, and dynamically builds an explicit internal dataflow representation from a conventional ISA.
Efficient throughput cores for asymmetric manycore processors
- Computer Science
- 2009
This work shows how the single-thread performance of small, scalar cores can be increased or dynamically combined to speed up programs with only a limited number of parallel threads.
Forwardflow: a scalable core for power-constrained CMPs
- Computer ScienceISCA
- 2010
This work presents the Forwardflow Architecture, which can scale its execution logic up to run single threads, or down to run multiple threads in a CMP, and allows system software to select the performance point that best matches available power.
Federation: Repurposing scalar cores for out-of-order instruction issue
- Computer Science2008 45th ACM/IEEE Design Automation Conference
- 2008
A way to repurpose a pair of scalar cores into a 2-way out-of-order issue core with minimal area overhead and achieves comparable performance to a dedicated out- of-order core and dissipates less power as well.
Non-speculative enhancements for the scheduling logic
- Business
- 2010
This thesis proposes the idea of Dependence Level Scheduler (DLS), which is able to tolerate the latency in the wakeup-select hardware-loop, and looks for obtaining the ideal performance of a sequential model, and the costs of a model sliced into arbitration domains.
Federation: Out-of-Order Execution using Simple In-Order Cores
- Computer Science
- 2007
Federating each pair of neighboring, scalar cores provides a scalable, energy-efficient, and area-efficient solution for limited thread counts, with the ability to boost performance across a wide range ofthread counts, until thread count returns to a level at which the baseline, multithreaded, “throughput mode” can resume.
Reconstructing Out-of-Order Issue Queue
- Computer Science2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)
- 2022
Ballerino is a novel microarchitecture that performs balanced and cache-miss-tolerable dynamic scheduling via a complementary combination of cascaded and clustered in-order IQs, and achieves comparable performance to an 8-wide out-of-order core by using twelve in-orders, improving core-wide energy efficiency by 20%.
SQUIP: Exploiting the Scheduler Queue Contention Side Channel
- Computer Science
- 2022
The SQUIP attack, the first side-channel attack on scheduler queues, which are critical for deciding the schedule of instructions to be executed in superscalar CPUs, is presented and reverse-engineer the behavior of the scheduler queue on these CPUs and show that they can be primed and probed.
References
SHOWING 1-10 OF 45 REFERENCES
Continual flow pipelines
- Computer ScienceASPLOS XI
- 2004
Continual Flow Pipelines (CFP) is presented as a new non-blocking processor pipeline architecture that achieves the performance of a large instruction window without requiring cycle-critical structures such as the scheduler and register file to be large.
Efficient dynamic scheduling through tag elimination
- Computer ScienceProceedings 29th Annual International Symposium on Computer Architecture
- 2002
Simulation-based analyses find that most instructions enter the window with at least one of their input operands already available, and introduce a last-tag speculation mechanism that eliminates all remaining tag comparators except those for the last arriving input operand, allowing dynamic schedulers with approximately one quarter of the tag comparator found in conventional designs.
Cyclone: a broadcast-free dynamic instruction scheduler with selective replay
- Computer Science30th Annual International Symposium on Computer Architecture, 2003. Proceedings.
- 2003
The Cyclone scheduler is presented, a novel design that captures the benefits of both compile-and run-time scheduling that can rival the instruction throughput of similarly wide monolithic dynamic schedulers.
Hierarchical scheduling windows
- Computer Science35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings.
- 2002
Hierarchical Scheduling Windows is introduced, which exploits latency tolerant instructions in order to reduce implementation complexity and yields a very large instruction window that tolerates wakeup, select, and bypass latency, while extracting significant far flung ILP.
On-chip interconnects and instruction steering schemes for clustered microarchitectures
- Computer ScienceIEEE Transactions on Parallel and Distributed Systems
- 2005
This work investigates the design of on-chip interconnection networks for clustered superscalar microarchitectures, and proposes some point-to-point cluster interconnects and new improved instruction steering schemes that achieve much better performance than bus-based ones.
A high-speed dynamic instruction scheduling scheme for supersealar processors
- Computer ScienceProceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34
- 2001
A new scheduling scheme not based on the association but on matrices which represent the dependences between instructions, which achieves 2.7GHz clock speed for the IPC degradation of about 1%.
A large, fast instruction window for tolerating cache misses
- Computer ScienceProceedings 29th Annual International Symposium on Computer Architecture
- 2002
Simulations reveal that, for an 8-way processor, a 2K-entry WIB with a 32-entry issue queue can achieve speedups of 20, 84%, and 50% over a conventional 32- entry issue queue for a subset of the SPEC CINT2000, SPEC CFP2000, and Olden benchmarks, respectively.
Slack: maximizing performance under technological constraints
- Computer ScienceProceedings 29th Annual International Symposium on Computer Architecture
- 2002
This work develops slack for use in creating control policies that match program execution behavior to machine design, and illustrates how to create a control policy based on slack for steering instructions among fast and slow power pipelines.
Focusing processor policies via critical-path prediction
- Computer ScienceProceedings 28th Annual International Symposium on Computer Architecture
- 2001
This paper introduces a hardware predictor of instruction criticality and uses it to improve performance, using a dependence-graph model of the microarchitectural critical path that identifies execution bottlenecks by incorporating both data and machine-specific dependences.
Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth
- Computer Science37th International Symposium on Microarchitecture (MICRO-37'04)
- 2004
It is shown that mini-graphs can improve performance by amplifying the bandwidths of a superscalar processor's stages and the capacities of many of its structures without custom latency-reduction hardware.




















