DiG: enabling out-of-band scalable high-resolution monitoring for data-center analytics, automation and control (extended)

  title={DiG: enabling out-of-band scalable high-resolution monitoring for data-center analytics, automation and control (extended)},
  author={Antonio Libri and Andrea Bartolini and L. Benini},
  journal={arXiv: Distributed, Parallel, and Cluster Computing},
Data centers are increasing in size and complexity, and we need scalable approaches to support their automated analysis and control. Performance counters and power consumption are their key "vital signs". State-of-the-Art (SoA) monitoring systems provide built-in tools to collect performance measurements, and custom solutions to get insight on their power consumption. However, with the increase in measurement resolution (in time and space) and the ensuing huge amount of measurement data to… Expand
pAElla: Edge AI-Based Real-Time Malware Detection in Data Centers
A novel lightweight and scalable approach to increase the security of DCs/SCs, which involves AI-powered edge computing on high-resolution power consumption and targets real-time malware detection, and is significantly outperforming SoA approaches in terms of accuracy. Expand
Hybrid Approach to HPC Cluster Telemetry and Hardware Log Analytics
This paper describes a highly scalable telemetry architecture that allows event aggregation, application of RAS policies, and provides the ability for cluster control system feedback and is in use in both cloud and HPC environments. Expand
Countdown Slack: A Run-Time Library to Reduce Energy Footprint in Large-Scale MPI Applications
A new approach based on the separation of communication phases and slack during MPI calls and a timeout algorithm to cope with the hardware power management latency is proposed, which jointly makes it possible to achieve performance-neutral power saving in MPI applications without requiring labor-intensive and risky application source code modifications. Expand
Fine-grained application tuning on OpenPOWER HPC systems
This paper evaluates the approach of dynamic application tuning used to reduce energy consumption of HPC systems on the IBM’s Power8+ system and tuned two available hardware parameters: Dynamic Voltage and Frequency Scaling and concurrency throttling with focus on hyper-threading, which plays a significant role with respect to performance on Power8+. Expand
Application instrumentation for performance analysis and tuning with focus on energy efficiency
This work presents its on request inserted shared C/C++ API for the most common open‐source HPC performance analysis tools, which simplify the process of the manual instrumentation. Expand
Online Anomaly Detection in HPC Systems
This work uses a type of neural network called autoncoder trained to learn the normal behavior of a real, in-production HPC system and it is deployed on the edge of each computing node and it obtains a very good accuracy. Expand
Visualization and Machine Learning for Data Center Management
A novel tool for data center management that incorporates data visualization and machine learning capabilities is presented in the context of an action design research project conducted at a large government agency in Germany, which hosts three highly available data centers containing more than 10,000 servers. Expand


Evaluation of synchronization protocols for fine-grain HPC sensor data time-stamping and collection
How the performance of the two widely used network synchronization protocols, namely the Network Time Protocol and IEEE 1588, scale on a state-of-the-art embedded platform, namely a Beaglebone Black Board is evaluated. Expand
Power measurement techniques for energy-efficient computing: reconciling scalability, resolution, and accuracy
A scalable measurement solution for hundreds of nodes at millisecond granularity that is tightly integrated into the HPC system, and a sophisticated single-node instrumentation to measure the power consumption of application events in the microsecond range are discussed. Expand
Evaluation of NTP/PTP fine-grain synchronization performance in HPC clusters
The results show NTP can reach on computing nodes an accuracy of 2.6 μs and a precision below 2.7 μs, with negligible overhead, and PTP can be bounded below microseconds, with PTP and low-cost switches (no needs of GPS antenna). Expand
HDEEM: High Definition Energy Efficiency Monitoring
The High Definition Energy Efficiency Monitoring (HDEEM) infrastructure is introduced, a sophisticated approach towards systemwide and fine-grained power measurements that enable energy-aware performance optimizations of parallel codes. Expand
Continuous learning of HPC infrastructure models using big data analytics and in-memory processing tools
Exascale computing represents the next leap in the HPC race. Reaching this level of performance is subject to several engineering challenges such as energy consumption, equipment-cooling, reliabilityExpand
The D.A.V.I.D.E. big-data-powered fine-grain power and performance monitoring support
A new methodology based on a set of HW and SW extensions for fine-grain monitoring of power and aggregation of them for fast analysis and visualization is presented and described. Expand
Toward a smart data transfer node
The architecture, methods, and algorithms needed for a smart data transfer node to support future scientific computing systems that self-tune and self-manage are explored. Expand
Validation of Redfish: The Scalable Platform Management Standard
A Redfish Conformance Test Tool (RCTT) is designed which tests compliance and re-instates faith on the viability of Redfish at meeting customer expectations to validate Redfish's capability regarding the performance, scalability and security aspects as defined in the Redfish Specification. Expand
RAPL in Action
Experimental results suggest RAPL can be a very useful tool to measure and monitor the energy consumption of servers without deploying any complex power meters and show that there are still some open issues, such as driver support, non-atomicity of register updates, and unpredictable timings that might weaken the usability of R APL in certain scenarios. Expand
Design of an Energy Aware Petaflops Class High Performance Cluster Based on Power Architecture
An innovative and energy efficient High Performance Computing cluster designed by E4 Computer Engineering for PRACE, built using best-in-class components plus custom hardware and an innovative system middleware software. Expand