Learn More
To be agile and cost effective, data centers should allow dynamic resource allocation across large server pools. In particular, the data center network should enable any server to be assigned to any service. To meet these goals, we present VL2, a practical network architecture that scales to support huge data centers with uniform high capacity between(More)
Cloud data centers host diverse applications, mixing workloads that require small predictable latency with others requiring large sustained throughput. In this environment, today's state-of-the-art TCP protocol falls short. We present measurements of a 6000 server production cluster and reveal impairments that lead to high application latencies, rooted in(More)
The data centers used to create cloud services represent a significant investment in capital outlay and ongoing costs. Accordingly, we first examine the costs of cloud service data centers today. The cost breakdown reveals the importance of optimizing work completed per dollar invested. Unfortunately, the resources inside the data centers often operate at(More)
Engineering a large IP backbone network without an accurate, network-wide view of the traffic demands is challenging. Shifts in user behavior, changes in routing policies, and failures of network elements can result in significant (and sudden) fluctuations in load. In this paper, we present a model of traffic demands to support traffic engineering and(More)
A matrix giving the traffic volumes between origin and destination in a network has tremendously potential utility for network capacity planning and management. Unfortunately, traffic matrices are generally unavailable in large operational IP networks. On the other hand, link load measurements are readily available in IP networks. In this paper, we propose(More)
As IP technologies providing both tremendous capacity and the ability to establish dynamic secure associations between endpoints emerge, Virtual Private Networks (VPNs) are going through dramatic growth. The number of endpoints per VPN is growing and the communication pattern between endpoints is becoming increasingly hard to forecast. Consequently, users(More)
Experience from an operational map-reduce cluster reveals that outliers significantly prolong job completion. The causes for outliers include (i) machine characteristics-both hardware reliability (e.g., disk failures) as well as run-time contention for processor, memory and other resources, (ii) network characteristics with varying bandwidths and congestion(More)
We explore the nature of trac in data centers, designed to support the mining of massive data sets. We instrument the servers to collect socket-level logs, with negligible performance impact. In a  server operational cluster, we thus amass roughly a petabyte of measurements over two months, from which we obtain and report detailed views of trac and(More)
– While today's data centers are multiplexed across many non-cooperating applications, they lack effective means to share their network. Relying on TCP's congestion control, as we show from experiments in production data centers, opens up the network to denial of service attacks and performance interference. We present Seawall, a network bandwidth(More)
Localizing the sources of performance problems in large enterprise networks is extremely challenging. Dependencies are numerous, complex and inherently <i>multi-level</i>, spanning hardware and software components across the network and the computing infrastructure. To exploit these dependencies for fast, accurate problem localization, we introduce an(More)