#### Filter Results:

- Full text PDF available (49)

#### Publication Year

2001

2017

- This year (5)
- Last 5 years (46)
- Last 10 years (72)

#### Publication Type

#### Co-author

#### Journals and Conferences

#### Key Phrases

Learn More

- Zizhong Chen, Jack J. Dongarra
- SIAM J. Matrix Analysis Applications
- 2005

Abstract. Let Gm×n be an m × n real random matrix whose elements are independent and identically distributed standard normal random variables, and let κ2(Gm×n) be the 2-norm condition number of Gm×n. We prove that, for any m ≥ 2, n ≥ 2, and x ≥ |n − m| + 1, κ2(Gm×n) satisfies 1 √ 2π (c/x)|n−m|+1 < P ( κ2(Gm×n) n/(|n−m|+1) > x) < 1 √ 2π (C/x)|n−m|+1, where… (More)

- Zizhong Chen
- PPOPP
- 2013

Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Large supercomputers are especially susceptible to soft errors because of their large number of components. Soft errors can generally be detected offline through the comparison of the final computation results of two duplicated computations, but… (More)

- Zizhong Chen, Jack J. Dongarra
- IEEE Transactions on Parallel and Distributed…
- 2008

Fail-stop failures in distributed environments are often tolerated by checkpointing or message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix-matrix multiplication kennel can be tolerated without checkpointing or message logging. It has been proved in previous algorithm-based fault tolerance that, for matrix-matrix… (More)

- Zizhong Chen, Graham E. Fagg, +4 authors Jack J. Dongarra
- PPOPP
- 2005

As the number of processors in today's high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the execution time of many current high performance computing applications. Although today's architectures are usually robust enough to survive node failures without suffering complete… (More)

- Teresa Davies, Christer Karlsson, Hui Liu, Zizhong Chen
- ICS
- 2011

The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For long running applications using a large number of processors, it is essential that fault tolerance be used to prevent a total loss of all finished computations after a failure.… (More)

- Graham E. Fagg, Edgar Gabriel, +4 authors Jack J. Dongarra
- IJHPCA
- 2005

With increasing numbers of processors on todays machines, the probability for node or link failures is also increasing. Therefore, application level fault-tolerance is becoming more of an important issue for both end-users and the institutions running the machines. This paper presents the semantics of a fault tolerant version of the Message Passing… (More)

- Zizhong Chen
- HPDC
- 2011

In today's high performance computing practice, fail-stop failures are often tolerated by checkpointing. While checkpointing is a very general technique and can often be applied to a wide range of applications, it often introduces a considerable overhead especially when computations reach petascale and beyond. In this paper, we show that, for many iterative… (More)

- Julien Langou, Zizhong Chen, George Bosilca, Jack J. Dongarra
- SIAM J. Scientific Computing
- 2007

Several recovery techniques for parallel iterative methods are presented. First, the implementation of checkpoints in parallel iterative methods is described and analyzed. Then, a simple checkpoint-free faulttolerant scheme for parallel iterative methods, the lossy approach, is presented. When one processor fails and all its data is lost, the system is… (More)

- Zizhong Chen, Jack J. Dongarra
- International Conference on Computational Science
- 2005

Error correction codes defined over realnumber and complex-number fields have been studied and recognized as useful in many applications. However, most real-number and complex-number codes in literature are quite suspect in their numerical stability. In this paper, we introduce a class of numerically stable real-number and complex-number codes that are… (More)

- Zizhong Chen, Graham E. Fagg, +4 authors Jack Dongarra
- 2005

"!# $ &% ' (*) + !-,. / 0 "' . + 1 . !" /, 32546 7 + 8' 9: !# + ;9< 9: =' !->? . + @' +!# ,5 !-,. !BA 8' >B(+ C ' ;D !. 5 !+ E "' (6 F !-,. G H "' I . + !, ' >#!8'3 !. JC 6>B . , + &% ' (*) K' L !B M' 6 ' >->B(+ 9 N M . ,. " O !-OP &% =' !-># M2Q!B . M R* ; !# , 9 >C N( N S =' !#>.4. + N F &% '8(*) F !-,. G 0 "' C + 9 !,5' >-!8' !. E 8' T E O&!#O. C &% C ='… (More)