Corpus ID: 186911

Fault tolerance for stream programs on parallel platforms

  title={Fault tolerance for stream programs on parallel platforms},
  author={Vicent Sanz Marco},
  • Vicent Sanz Marco
  • Published 2015
  • Computer Science
  • A distributed system is defined as a collection of autonomous computers connected by a network, and with the appropriate distributed software for the system to be seen by users as a single entity capable of providing computing facilities. Distributed systems with centralised control have a distinguished control node, called leader node. The main role of a leader node is to distribute and manage shared resources in a resource-e cient manner. A distributed system with centralised control can use… CONTINUE READING


    Fault-tolerance in the borealis distributed stream processing system
    • 184
    • PDF
    Distributed system fault tolerance using message logging and checkpointing
    • 100
    • PDF
    Fault-Tolerant Parallel and Distributed Systems
    • 31
    • PDF
    Fault tolerance in distributed systems
    • 533
    • PDF
    Coordinated checkpoint versus message log for fault tolerant MPI
    • 54
    Distributed fault-tolerance for large multiprocessor systems
    • 195
    A survey of checkpointing algorithms for parallel and distributed computers
    • 66
    • Highly Influential
    • PDF
    Finding a suitable checkpoint and recovery protocol for a distributed application
    • H. Paul, A. Gupta, Amit Sharma
    • Computer Science
    • J. Parallel Distributed Comput.
    • 2006
    • 5
    A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
    • 107
    • PDF
    MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes
    • 348
    • PDF