#### Filter Results:

#### Publication Year

2001

2016

#### Publication Type

#### Co-author

#### Key Phrase

#### Publication Venue

#### Data Set Used

Learn More

Let G m×n be an m × n real random matrix whose elements are independent and identically distributed standard normal random variables, and let κ 2 (G m×n) be the 2-norm condition number of G m×n. We prove that, for any m ≥ 2, n ≥ 2, and x ≥ |n − m| + 1, κ 2 (G m×n) satisfies 1 √ 2π (c/x) |n−m|+1 < P (κ 2 (G m×n) n/(|n−m|+1) > x) < 1 √ 2π (C/x) |n−m|+1 ,… (More)

Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Large supercomputers are especially susceptible to soft errors because of their large number of components. Soft errors can generally be detected offline through the comparison of the final computation results of two duplicated computations, but… (More)

The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For long running applications using a large number of processors, it is essential that fault tolerance be used to prevent a total loss of all finished computations after a failure.… (More)

—Modeling and analysis of large scale scientific systems often use linear least squares regression, frequently employing Cholesky factorization to solve the resulting set of linear equations. With large matrices, this often will be performed in high performance clusters containing many processors. Assuming a constant failure rate per processor, the… (More)

With increasing numbers of processors on todays machines , the probability for node or link failures is also increasing. Therefore, application level fault-tolerance is becoming more of an important issue for both end-users and the institutions running the machines. This paper presents the semantics of a fault tolerant version of the Message Passing… (More)

When more processors are used for a calculation, the probability that one will fail during the calculation increases. Fault tolerance is a technique for allowing a calculation to survive a failure, and includes recovering lost data. A common method of recovery is diskless checkpointing. However, it has high overhead when a large amount of data is involved,… (More)

Fail-stop failures in distributed environments are often tolerated by checkpointing or message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix matrix multiplication kernel can be tolerated without checkpointing or message logging. It has been proved in the previous algorithm-based fault tolerance research that, for… (More)

As the desire of scientists to perform ever larger computations drives the size of today's high performance computers from hundreds, to thousands, and even tens of thousands of processors, node failures in these computers are becoming frequent events. Although checkpoint/rollback-recovery is the typical technique to tolerate such failures, it often… (More)

- Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun, George Bosilca +1 other
- PPOPP
- 2005

As the number of processors in today's high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the execution time of many current high performance computing applications. Although today's architectures are usually robust enough to survive node failures without suffering complete… (More)

This article describes the context, design, and recent development of the LAPACK for clusters (LFC) project. It has been developed in the framework of Self-Adapting Numerical Software (SANS) since we believe such an approach can deliver the convenience and ease of use of existing sequential environments bundled with the power and versatility of highly tuned… (More)