Accelerating Architectural Simulation Via Statistical Techniques: A Survey
Chip power consumption has reached its limits, leading to the flattening of single-core performance. We propose the 10x10 processor, a federated heterogeneous multi-core architecture, where each core is an ensemble of u-engines (micro-engines, similar to accelerators) specialized for different workload groups to achieve dramatically higher energy efficiency. The u-engines collectively target the entire general-purpose workload space. The problem we study in this article is selecting the set of workloads that each u-engine should be customized for. For this problem we study the computation structure of a wide variety of workloads and cluster together workloads with similar computation structures, the idea being that each u-engine will be customized for the compute structures exhibited by a particular cluster. The constraint on this problem is the silicon budget of a processor. Lower silicon budgets accommodate fewer uengines and require individual u-engines to target larger segments of the workload space which leads to lower energy efficiency benefits from customization, because there is more variation among the compute structures making up each cluster. Therefore, we also study how workload coverage and benefit can be maximized for a given silicon budget. We study a broad general-purpose workload that includes 34 codes from 6 benchmark suites, identifying the most frequent functions, and clustering them based on two sets of instruction usage features (high-resolution and low-resolution) into 8, 16, 32, 64, 128 clusters respectively. We develop abstract metrics (coverage and weighted customization benefit) to evaluate the clusters. We show significant potential payoffs with four benefit models: 2-3x (square root model), 4-10x (linear model), 12-24x (quadratic model), and 22-26x (cubic model).