Learn More
Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. Today's HPC applications typically tolerate fail-stop failures by check pointing. However, check pointing will lose its efficiency when system becoming very large. An alternative method is algorithm-based(More)
The knowledge grid needs to operate with a scalable platform to provide large-scale intelligent services. A key function of such a platform is to efficiently support various complex queries in a dynamic large-scale network environment. This paper proposes a platform to support index-based path queries by incorporating a semantic overlay with an underlying(More)
The scalability problem is in the first place of the dozen long-term information-technology research goals indicated by Jim Gray [2]. Chip multiprocessors (CMPs) or multicores are emerging as the dominant computing platform. In the multicore era, the scalability problem is still an interesting long-term goal, and it will become more urgent in the next(More)
The resource space model (RSM) is a semantic data model based on orthogonal classification semantics for efficiently managing various resources in the future interconnection environment. This paper extends the RSM in theory by formalizing the resource space, investigating its characteristics from the perspective of set theory, defining the resource space(More)
This paper presents Harmonic Ring (HRing), a structured peer-to-peer (P2P) overlay where long links are built along the ring with decreasing probabilities coinciding with the Harmonic Series. HRing constructs routing tables based on the distance between node positions instead of node IDs in order to eliminate the effect of node ID distribution on the long(More)
With the growing scale of high-performance computing (HPC) systems, today and more so tomorrow, faults are a norm rather than an exception. HPC applications typically tolerate fail-stop failures under the stop-and-wait scheme, where even if only one processor fails, the whole system has to stop and wait for the recovery of the corrupted data. It is now a(More)
Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. HPC applications typically tolerate fail-stop failures by checkpointing. Another promising method is in the algorithm level, called algorithmic recovery. These two methods can achieve high efficiency when(More)