Paxos made live: an engineering perspective

@inproceedings{Chandra2007PaxosML,
  title={Paxos made live: an engineering perspective},
  author={Tushar Deepak Chandra and Robert Griesemer and Joshua Redstone},
  booktitle={PODC '07},
  year={2007}
}
We describe our experience in building a fault-tolerant data-base using the Paxos consensus algorithm. Despite the existing literature in the field, building such a database proved to be non-trivial. We describe selected algorithmic and engineering problems encountered, and the solutions we found for them. Our measurements indicate that we have built a competitive system. 

Figures and Tables from this paper

Paxos for System Builders: an overview
TLDR
An overview of Paxos for System Builders, a complete specification of the Paxos replication protocol such that system builders can understand it and implement it and detail the safety and liveness properties guaranteed by the specification.
State based Paxos
TLDR
The performance of State Paxos is evaluated, a novel variation of the Paxos consensus algorithm that exploits overwrite semantics to eliminate most of the complexities and inefficiencies introduced by state management.
Paxos for System Builders
TLDR
This paper presents a complete specification of the Paxos replication protocol such that system builders can understand it and implement it and detail the safety and liveness properties guaranteed by the specification.
Tutorial Summary: Paxos Explained from Scratch
TLDR
This tutorial aims to address the difficulty of Paxos by visualizing Paxos in a completely new way, starting from a naive solution and strong assumptions, and derived in a step-wise fashion.
CS 240 H : Implementing Paxos in
TLDR
This project implements the basic Paxos algorithm in Haskell, whose power of expressiveness can be helpful to simplify the implementation.
Understanding Paxos and other distributed consensus algorithms
TLDR
This note provides a quick explanation of Paxos, a novel proof of correctness that is intended to provide insight into why the algorithm is as simple as the author has claimed, an explanation of why it does and why it doesn’t work, and has a brief discussion of alternatives.
Practical Experience Report: The Performance of Paxos in the Cloud
TLDR
The results of an extensive performance evaluation conducted using four open-source implementations of Paxos deployed in Amazon's EC2, finding that each implementation is optimized in a number of different ways, resulting in very different behavior.
The Performance of Paxos in the Cloud
TLDR
The results of an extensive performance evaluation conducted using four open-source implementations of Paxos deployed in Amazon's EC2, finding that each implementation is optimized in a number of different ways, resulting in very different behavior.
Yet Another Visit to Paxos
This paper presents a modular decomposition of crashtolerant and Byzantine-tolerant protocols for reaching consensus that use the method introduced by the Paxos algorithm of Lamport and by the
Performance Engineering of a Lightweight Fault Tolerance Framework
TLDR
This thesis presents a lightweight consensus framework Paxos-Based Fault Tolerance (PFT) framework and its practical implementation, which includes how the system tolerates faults under practical conditions where the replicas might not be strictly homogeneous due to the asynchrony of their deployment environment.
...
...

References

SHOWING 1-10 OF 19 REFERENCES
How to Build a Highly Available System Using Consensus
TLDR
The general scheme for efficient highly available computing is explained, a general method for understanding concurrent and fault-tolerant programs is given, and the Paxos algorithm is derived as an example of the method.
Implementing fault-tolerant services using the state machine approach: a tutorial
TLDR
The state machine approach is a general method for implementing fault-tolerant services in distributed systems and protocols for two different failure models—Byzantine and fail stop are described.
Revisiting the PAXOS algorithm
Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems
TLDR
This paper presents a new replication algorithm that has desirable performance properties, based on the primary copy technique, and uses a special kind of timestamp called a viewstamp to detect lost information.
Reaching agreement on processor-group membrship in synchronous distributed systems
TLDR
Three simple protocols are proposed that provide all correct processors with consistent views of the processor-group membership and guarantee bounded processor failure detection and join delays.
Leases: an efficient fault-tolerant mechanism for distributed file cache consistency
TLDR
An analytic model and an evaluation for file access in the V system show that leases of short duration provide good performance and the impact of leases on performance grows more significant in systems of larger scale and higher processor performance.
Boxwood: Abstractions as the Foundation for Storage Infrastructure
TLDR
This paper has built a system called Boxwood to explore the feasibility and utility of providing high-level abstractions or data structures as the fundamental storage infrastructure, and has implemented an NFSv2 file service that demonstrates the promise of this approach.
Advances in ULTRA-Dependable Distributed Systems
TLDR
Fault tolerance concepts and hard real-time perspectives that apply jointly to ultra-dependable systems for critical control applications are explored.
The Chubby lock service for loosely-coupled distributed systems
TLDR
The paper describes the initial design and expected use, compares it with actual use, and explains how the design had to be modified to accommodate the differences.
The part-time parliament
TLDR
The Paxon parliament's protocol provides a new way of implementing the state machine approach to the design of distributed systems.
...
...