• Corpus ID: 14142751

A Fault-Tolerant Engineered Network

@inproceedings{Liu2013AFE,
  title={A Fault-Tolerant Engineered Network},
  author={Vincent Liu and Daniel Halperin and Arvind Krishnamurthy and Thomas E. Anderson},
  year={2013}
}
The data center network is increasingly a cost, reliability and performance bottleneck for cloud computing. Although multi-tree topologies can provide scalable bandwidth and traditional routing algorithms can provide eventual fault tolerance, we argue that recovery speed can be dramatically improved through the co-design of the network topology, routing algorithm and failure detector. We create an engineered network and routing protocol that directly address the failure characteristics observed… 
WCMP: weighted cost multipathing for improved fairness in data centers
TLDR
This work presents a set of simple algorithms that achieve Weighted Cost Multipath (WCMP) to balance traffic in the data center based on the changing network topology and shows that variation in flow bandwidths can be reduced by as much as 25X by employing WCMP relative to ECMP.
Expander Datacenters: From Theory to Practice
TLDR
This paper examines if expanders can be effective for the technology and environments practical in today's data centers, including the use of traditional protocols, at both small and large scale while complying with common practices such as over-subscription.
A Large Scale Study of Data Center Network Reliability
TLDR
This paper presents a large scale, longitudinal study of data center network reliability based on operational data collected from the production network infrastructure at Facebook, one of the largest web service providers in the world.
Micro Load Balancing in Data Centers with DRILL
TLDR
This work provides a very simple in-network load balancing scheduling algorithm called DRILL which is purely local to each switch and outperforms CONGA, a recent global edge-based load balancing scheme for data centers.
Xpander: Towards Optimal-Performance Datacenters
TLDR
It is shown that the benefits of state-of-the-art proposals are derived from the fact that they are (implicitly) utilizing "expander graphs" (aka expanders) as their network topologies, thus unveiling a unifying theme of these proposals.
Minimal Rewiring: Efficient Live Expansion for Clos Data Center Networks
TLDR
This work demonstrates that it is indeed possible to design expandable Clos DCNs, and to expand them while they are carrying live traffic, without incurring packet loss, and describes how to use integer linear programming (ILP) to minimize the number of patch-panel connections that must be changed, which makes expansions faster and cheaper.
Verifying distributed system implementations ( Description ) 1
  • Computer Science
  • 2015
TLDR
This work aims to close the formality gap and convert the practice of building distributed systems to a verification-based approach that eases the burden for programmers to implement correct, high-performance, and maintainable distributed systems.
Network configuration synthesis with abstract topologies
We develop Propane/AT, a system to synthesize provably-correct BGP (border gateway protocol) configurations for large, evolving networks from high-level specifications of topology, routing policy,
The Deforestation of L2
TLDR
This paper examines an alternate point in the L2 design space, which is simple, converges quickly, delivers packets during convergence, utilizes all available links, and can be extended to support both equal-cost multipath and efficient multicast.
Enabling Wide-Spread Communications on Optical Fabric with MegaSwitch
TLDR
Mega-Switch is presented, a multi-fiber ring optical fabric that exploits space division multiplexing across multiple fibers to deliver rearrangeably non-blocking communications to 30+ racks and 6000+ servers.
...
1
2
...

References

SHOWING 1-10 OF 25 REFERENCES
F10: A Fault-Tolerant Engineered Network
TLDR
This work creates an engineered network and routing protocol that can almost instantaneously reestablish connectivity and load balance, even in the presence of multiple failures, and shows that following network link and switch failures, F10 has less than 1/7th the packet loss of current schemes.
PortLand: a scalable fault-tolerant layer 2 data center network fabric
TLDR
Through the design and implementation of PortLand, a scalable, fault tolerant layer 2 routing and forwarding protocol for data center environments, it is shown that PortLand holds promise for supporting a ``plug-and-play" large-scale, data center network.
Dcell: a scalable and fault-tolerant network structure for data centers
TLDR
Results from theoretical analysis, simulations, and experiments show that DCell is a viable interconnection structure for data centers and can be incrementally expanded and a partial DCell provides the same appealing features.
A scalable, commodity data center network architecture
TLDR
This paper shows how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements and argues that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions.
Achieving convergence-free routing using failure-carrying packets
TLDR
This work proposes a technique called Failure-Carrying Packets (FCP), a technique that allows data packets to autonomously discover a working path without requiring completely up-to-date state in routers, and shows that it provides better routing guarantees under failures despite maintaining lesser state at the routers.
Improving datacenter performance and robustness with multipath TCP
TLDR
This work proposes using Multipath TCP as a replacement for TCP in large-scale data centers, as it can effectively and seamlessly use available bandwidth, giving improved throughput and better fairness on many topologies.
VL2: a scalable and flexible data center network
TLDR
VL2 is a practical network architecture that scales to support huge data centers with uniform high capacity between servers, performance isolation between services, and Ethernet layer-2 semantics, and is built on a working prototype.
Understanding network failures in data centers: measurement, analysis, and implications
TLDR
The first large-scale analysis of failures in a data center network is presented, finding that data center networks show high reliability, commodity switches such as ToRs and AggS are highly reliable, and network redundancy is only 40% effective in reducing the median impact of failure.
Hedera: Dynamic Flow Scheduling for Data Center Networks
TLDR
Hedera is presented, a scalable, dynamic flow scheduling system that adaptively schedules a multi-stage switching fabric to efficiently utilize aggregate network resources and delivers bisection bandwidth that is 96% of optimal and up to 113% better than static load-balancing methods.
The Extra Stage Cube: A Fault-Tolerant Interconnection Network for Supersystems
TLDR
It is shown that the ESC provides fault tolerance for any single failure, and the network can be controlled even when it has a failure, using a simple modification of a routing tag scheme proposed for the Generalized Cube.
...
1
2
3
...