#### Filter Results:

- Full text PDF available (40)

#### Publication Year

2001

2017

#### Publication Type

#### Co-author

#### Publication Venue

#### Data Set Used

#### Key Phrases

Learn More

- Zizhong Chen
- PPOPP
- 2013

Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Large supercomputers are especially susceptible to soft errors because of their large number of components. Soft errors can generally be detected offline through the comparison of the final computation results of two duplicated computations, but… (More)

- Zizhong Chen, Jack J. Dongarra
- SIAM J. Matrix Analysis Applications
- 2005

Let G m×n be an m × n real random matrix whose elements are independent and identically distributed standard normal random variables, and let κ 2 (G m×n) be the 2-norm condition number of G m×n. We prove that, for any m ≥ 2, n ≥ 2, and x ≥ |n − m| + 1, κ 2 (G m×n) satisfies 1 √ 2π (c/x) |n−m|+1 < P (κ 2 (G m×n) n/(|n−m|+1) > x) < 1 √ 2π (C/x) |n−m|+1 ,… (More)

- Teresa Davies, Christer Karlsson, Hui Liu, Chong Ding, Zizhong Chen
- ICS
- 2011

The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For long running applications using a large number of processors, it is essential that fault tolerance be used to prevent a total loss of all finished computations after a failure.… (More)

- Zizhong Chen
- Proceedings of the Conference on High Performance…
- 2009

It has been demonstrated recently that single fail-stop process failure in ScaLAPACK matrix multiplication can be tolerated without checkpointing. Multiple simultaneous processor failures can be tolerated without checkpointing by encoding matrices using a real-number erasure correcting code. However, the floating-point representation of a real number in… (More)

- Douglas Hakkarinen, Zizhong Chen
- 2010 IEEE International Symposium on Parallel…
- 2010

Modeling and analysis of large scale scientific systems often use linear least squares regression, frequently employing Cholesky factorization to solve the resulting set of linear equations. With large matrices, this often will be performed in high performance clusters containing many processors. Assuming a constant failure rate per processor, the… (More)

- Zizhong Chen, Jack J. Dongarra
- International Conference on Computational Science
- 2005

— Error correction codes defined over real-number and complex-number fields have been studied and recognized as useful in many applications. However, most real-number and complex-number codes in literature are quite suspect in their numerical stability. In this paper, we introduce a class of numerically stable real-number and complex-number codes that are… (More)

- Zizhong Chen, Graham E. Fagg, +4 authors Jack J. Dongarra
- PPOPP
- 2005

As the number of processors in today's high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the execution time of many current high performance computing applications. Although today's architectures are usually robust enough to survive node failures without suffering complete… (More)

- Zizhong Chen, Graham E. Fagg, +4 authors Jack Dongarra
- 2005

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to… (More)

- Zizhong Chen, Jack J. Dongarra, Piotr Luszczek, Kenneth Roche
- HICSS
- 2004

This article describes the context, design, and recent development of the LAPACK for Clusters (LFC) project. It has been developed in the framework of Self-Adapting Numerical Software (SANS) since we believe such an approach can deliver the convenience and ease of use of existing sequential environments bundled with the power and versatility of highly-tuned… (More)

- Li Tan, Zizhong Chen
- ICPE
- 2015

The presence of pervasive slack provides ample opportunities for achieving energy efficiency for HPC systems nowadays. Regardless of communication slack, classic energy saving approaches for saving energy during the slack otherwise include race-to-halt and CP-aware slack reclamation, which reply on power scaling techniques to adjust processor power states… (More)