On the Design of a Failure Detection Service for Large-Scale Distributed Systems

Abstract

It is widely recognized that distributed systems would greatly benefit from the availability of a generic failure detection service. There are however several issues that must be addressed before such a service can actually be implemented. In this paper, we highlight the main issues related to ensuring failure detection in large-scale systems, and overview the main solutions proposed in the literature so far. Then, we outline a pragmatic architecture for a failure detector service based on the φ-failure detector, and a combination of techniques proposed in related work.

6 Figures and Tables

Cite this paper

@inproceedings{Dfago2003OnTD, title={On the Design of a Failure Detection Service for Large-Scale Distributed Systems}, author={Xavier D{\'e}fago and Naohiro Hayashibara and Takuya Katayama}, year={2003} }