Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis

@article{Guo2015PingmeshAL,
  title={Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis},
  author={Chuanxiong Guo and Lihua Yuan and Dong Xiang and Yingnong Dang and Ray Huang and David A. Maltz and Zhaoyi Liu and Vin Wang and Bin Pang and Hua Chen and Zhi Lin and Varugis Kurien},
  journal={Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication},
  year={2015}
}
Can we get network latency between any two servers at any time in large-scale data center networks? The collected latency data can then be used to address a series of challenges: telling if an application perceived latency issue is caused by the network or not, defining and tracking network service level agreement (SLA), and automatic network troubleshooting. We have developed the Pingmesh system for large-scale data center network latency measurement and analysis to answer the above question… 

Figures and Tables from this paper

A First Look at Data Center Network Condition Through The Eyes of PTPmesh

TLDR
A better understanding is provided on how to exploit the measurement data offered by PTPmesh and a detailed analysis of PTP mesh measurements collected in ten data centers from three cloud providers reveal different latency, latency variance and packet loss characteristics across data centers.

PTPmesh: Data Center Network Latency Measurements Using PTP

  • Diana Andreea PopescuA. Moore
  • Computer Science
    2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)
  • 2017
TLDR
The Precision Time Protocol is used to infer network latency and packet loss in data centers from different cloud providers, using PTPd, an open-source software implementation of PTP.

Low Latency Software Rate Limiters for Cloud Networks

TLDR
This paper analyzes the specific reasons that ECN marking in software rate limiters leads to the throughput oscillation problem, and proposes two potential solutions to design software ratelimiters that can achieve stable high throughput and low latency.

deTector: a Topology-aware Monitoring System for Data Center Networks

Troubleshooting network performance issues is a challenging task especially in large-scale data center networks. This paper presents deTector, a network monitoring system that is able to detect and

Characterizing the impact of network latency on cloud-based applications’ performance

TLDR
This work quantifies the effect of network latency on several typical cloud workloads, varying in complexity and use cases, and shows that different applications are affected by fixed and variable latency to differing amounts.

Freeway: An Order-less User-space Framework for Non-real-time Applications

  • Yifan ShenKe Liu Mingyu Chen
  • Computer Science
    2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
  • 2020
TLDR
Evaluated the Freeway’s performance comparing with Linux TCP stack, and it is shown that Freeway achieves 100% more bandwidth utilization than Linux TCP Stack, and significantly reduces memory cost.

Evaluation of an InfiniBand Switch: Choose Latency or Bandwidth, but Not Both

TLDR
A performance measurement tool for RDMA-based networks, RPerf, is developed that is capable of precisely measuring the IB switch performance without hardware support, and finds that the evaluated switch can provide either low latency or high bandwidth, but not both simultaneously in a mixed-traffic scenario.

Title deTector : a Topology-aware Monitoring System for Data

Troubleshooting network performance issues is a challenging task especially in large-scale data center networks. This paper presents deTector, a network monitoring system that is able to detect and

FSO clusters for data center network management and packet telemetry

  • A. Alghadhban
  • Computer Science, Business
    SIGCOMM Posters and Demos
  • 2020
TLDR
This poster uses free-space optical communications (FSO) to build a flexible yet high-performance overlay network for network management traffic (NMT) and shows that Fparcel achieves a throughput of 79% of the benchmark.

Centralized performance control for datacenter networks

TLDR
This work proposes that a centralized controller should tightly regulate senders' use of the network according to operator policy, and evaluates two architectures: Fastpass and Flowtune, both of which achieve high throughput comparable to current networks.
...

References

SHOWING 1-10 OF 34 REFERENCES

VL2: a scalable and flexible data center network

TLDR
VL2 is a practical network architecture that scales to support huge data centers with uniform high capacity between servers, performance isolation between services, and Ethernet layer-2 semantics, and is built on a working prototype.

Network traffic characteristics of data centers in the wild

TLDR
An empirical study of the network traffic in 10 data centers belonging to three different categories, including university, enterprise campus, and cloud data centers, which includes not only data centers employed by large online service providers offering Internet-facing applications but also data centers used to host data-intensive (MapReduce style) applications.

Achieving high utilization with software-driven WAN

TLDR
A novel technique is developed that leverages a small amount of scratch capacity on links to apply updates in a provably congestion-free manner, without making any assumptions about the order and timing of updates at individual switches.

B4: experience with a globally-deployed software defined wan

TLDR
This work presents the design, implementation, and evaluation of B4, a private WAN connecting Google's data centers across the planet, using OpenFlow to control relatively simple switches built from merchant silicon.

Bullet trains: a study of NIC burst behavior at microsecond timescales

TLDR
The burst behavior of traffic emanating from a 10-Gbps end host across a variety of data center applications is studied, finding that at 10--100 microsecond timescales, the traffic exhibits large bursts (i.e., 10s of packets in length).

The nature of data center traffic: measurements & analysis

TLDR
The nature of traffic in data centers is explored, designed to support the mining of massive data sets, and a petabyte of measurements over two months are obtained, from which detailed views of traffic and congestion conditions and patterns are obtained.

Virtual network diagnosis as a service

TLDR
A case is made for providing virtual network diagnosis as a service in the cloud and a Virtual Network Diagnosis (VND) framework is proposed, which reduces the data collection and processing overhead by performing local flow capture and on-demand query execution.

Ananta: cloud scale load balancing

TLDR
The requirements of a cloud-scale load balancer, the design of Ananta and lessons learnt from its implementation and operation in the Windows Azure public cloud are described.

Windows Azure Storage: a highly available cloud storage service with strong consistency

TLDR
The WAS architecture, global namespace, and data model is described, as well as its resource provisioning, load balancing, and replication systems.

Autopilot: automatic data center management

TLDR
The first version of Autopilot is described, the automatic data center management infrastructure developed within Microsoft over the last few years, responsible for automating software provisioning and deployment; system monitoring; and carrying out repair actions to deal with faulty software and hardware.