Learn More
Conventional thermal management techniques are reactive in nature; that is, they take action after temperature reaches a predetermined threshold value. Such approaches do not always minimize and balance the temperature on the chip, and furthermore, control temperature at a noticeable performance cost. In this work, we investigate how to use predictors for(More)
Designing thermal management strategies that reduce the impact of hot spots and on-die temperature variations at low performance cost is a very significant challenge for multiprocessor system-on-chips (MPSoCs). In this work, we present a proactive MPSoC thermal management approach, which predicts the future temperature and adjusts the job allocation on the(More)
In deep submicron circuits, thermal hot spots and high temperature gradients increase the cooling costs, and degrade reliability and performance. In this paper, we propose a low-cost temperature management strategy for multicore systems to reduce the adverse effects of hot spots and temperature variations. Our technique utilizes online learning to select(More)
New proactive fault monitoring innovations are being developed, demonstrated on executing servers, and productized for enhancing the reliability, availability, and serviceability of enterprise-class servers. A continuous system telemetry harness (CSTH) has been developed that collects time series signals relating to the health of dynamically executing(More)
Thermal hot spots and temperature gradients on the die need to be minimized to manufacture reliable systems while meeting energy and performance constraints. In this work, we solve the task scheduling problem for multiprocessor system-on-chips (MPSoCs) using Integer Linear Programming (ILP). The goal of our optimization is minimizing the hot spots and(More)
This paper presents a real-time machine learning technique that has been adapted from the field of statistical process control (SPC) to give early annunciation of incipient anomalies in signals and processes involving enterprise computing systems and associated networks. A binary-hypothesis technique called the sequential probability ratio test (SPRT)(More)
Preventing thermal hot spots and large temperature variations on the die is critical for addressing the challenges in system reliability, performance, cooling cost and leakage power. Reactive thermal management methods, which take action after temperature reaches a given threshold, maintain the temperature below a critical level at the cost of performance,(More)
Memory leaks are known to be a major cause of reliability and performance issues in software. This paper describes a run-time scheme that detects and removes memory leaks with minimal performance overhead and with no modifications to application source code. The scheme consists of a first stage where a pattern recognition technique proactively detects(More)