Learn More
We describe the design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT for example), or it can come from the execution of a(More)
Recently, there has been a trend towards clustered microarchitectures to reduce the cycle time for wide-issue microprocessors. In such processors, the register file and functional units are partitioned and grouped into clusters. Instruction scheduling for a clustered machine requires assignment and scheduling of operations to the clusters. In this paper, a(More)
Dynamic optimization refers to the runtime optimization of a native program binary. This report describes the design and implementation of Dynamo, a prototype dynamic optimizer that is capable of optimizing a native program binary at runtime. Dynamo is a realistic implementation, not a simulation, that is written entirely in user-level software, and runs on(More)
Instruction scheduling is one of the most important phases of compilation for high-performance processors. A compiler typically divides a program into multiple regions of code and then schedules each region. Many past eeorts have focused on linear regions such as traces and superblocks. The linearity o f these regions can limit speculation, leading to(More)
Instruction scheduling is a compile-time technique for extracting parallelism from programs for statically scheduled instruction-level parallel processors. Typically, an instruction scheduler partitions a program into regions and then schedules each region. One style of region represents a program as a set of decision trees or treegions. The non-linear(More)
Many performance/cost advantages can be gained if a chip-set is optimally redesigned to take advantage of the high wire density, fast interconnect delays, and high pin-counts available in MCM-D//ip-chip technology. Examples are given showing the conditions where the cost of the system can be reduced through chip partitioning and how the performance/cost of(More)
VLIW architectures use very wide instruction words in conjunction with high bandwidth to the instruction cache to achieve multiple instruction issue. This report uses the TINKER experimental testbed to examine instruction fetch and instruction cache mechanisms for VLIWs. A compressed instruction encoding for VLIWs is defined and a classification scheme for(More)
Many contemporary multiple issue processors employ out-of-order scheduling hardware in the processor pipeline. Such s c heduling hardware can yield good performance without relying on compile-time scheduling. The hardware can also schedule around unexpected run-time occurrences such a s c a c he misses. As issue widths increase, however, the complexity o f(More)
Interrupt handling in out-of-order execution processors requires complex hardware schemes to maintain the sequential state. The amount of hardware will be substantial in VLIW architectures due to the nature of issuing a very large number of instructions in each cycle. It is hard to implement precise interrupts in out-of-order execution machines, especially(More)