Learn More
Designing thermal management strategies that reduce the impact of hot spots and on-die temperature variations at low performance cost is a very significant challenge for multiprocessor system-on-chips (MPSoCs). In this work, we present a proactive MPSoC thermal management approach, which predicts the future temperature and adjusts the job allocation on the(More)
New proactive fault monitoring innovations are being developed, demonstrated on executing servers, and productized for enhancing the reliability, availability, and serviceability of enterprise-class servers. A continuous system telemetry harness (CSTH) has been developed that collects time series signals relating to the health of dynamically executing(More)
Conventional thermal management techniques are reactive in nature; that is, they take action after temperature reaches a predetermined threshold value. Such approaches do not always minimize and balance the temperature on the chip, and furthermore, control temperature at a noticeable performance cost. In this work, we investigate how to use predictors for(More)
Thermal hot spots and temperature gradients on the die need to be minimized to manufacture reliable systems while meeting energy and performance constraints. In this work, we solve the task scheduling problem for multiprocessor system-on-chips (MPSoCs) using Integer Linear Programming (ILP). The goal of our optimization is minimizing the hot spots and(More)
This paper presents a real-time machine learning technique that has been adapted from the field of statistical process control (SPC) to give early annunciation of incipient anomalies in signals and processes involving enterprise computing systems and associated networks. A binary-hypothesis technique called the sequential probability ratio test (SPRT)(More)
In deep submicron circuits, thermal hot spots and high temperature gradients increase the cooling costs, and degrade reliability and performance. In this paper, we propose a low-cost temperature management strategy for multicore systems to reduce the adverse effects of hot spots and temperature variations. Our technique utilizes online learning to select(More)
Preventing thermal hot spots and large temperature variations on the die is critical for addressing the challenges in system reliability, performance, cooling cost and leakage power. Reactive thermal management methods, which take action after temperature reaches a given threshold, maintain the temperature below a critical level at the cost of performance,(More)
Application-level software dependability is difficult to ensure. Thus it's typically used only in custom systems and is achieved using one-of-a-kind, handcrafted solutions. We are interested in understanding whether and how these techniques can be applied to more common, lower-end systems. To this end, we have adapted a condition-based maintenance (CBM)(More)