Learn More
A GPU cluster is a cluster equipped with GPU devices. Excellent acceleration is achievable for computation-intensive tasks (<i>e. g.</i> matrix multiplication and LINPACK) and bandwidth-intensive tasks with data locality (<i>e. g.</i> finite-difference simulation). Bandwidth-intensive tasks such as large-scale FFTs without data locality are harder to(More)
In this paper we discuss about our experiences in improving the performance of two key algorithms: the single-precision matrix-matrix multiplication subprogram (SGEMM of BLAS) and single-precision FFT using CUDA. The former is computation-intensive, while the latter is memory bandwidth or communication-intensive. A peak performance of 393 Gflops is achieved(More)
Previously we have shown that the transient receptor potential vanilloid 4 (TRPV4) channel regulates urinary bladder function, and that TRPV4 is expressed in both smooth muscle and urothelial cell types within the bladder wall.(1) Urothelial cells have also been suggested to express TRPV1 channels.(2) Therefore, we enzymatically isolated guinea-pig(More)
(UNU). It is based in Macau, and was founded in 1991. It started operations in July 1992. UNU/IIST is jointly funded by the Governor of Macau and the governments of the People's Republic of China and Portugal through a contribution to the UNU Endownment Fund. As well as providing two-thirds of the endownment fund, the Macau authorities also supply UNU/IIST(More)
An intermediate-level specification formalism (i.e., specification language supported by laws and a semantic model), Logs, is presented for PRAM and BSP styles of parallel programming. It extends pre-post sequential semantics to reveal states at points of global synchronization. The result is an integration of the pre-post and reactive-process styles of(More)
In this paper we discuss about our experiences in improving the performance of GEMM (both single and double precision) on Fermi architecture using CUDA, and how the new features of Fermi such as cache affect performance. It is found that the addition of cache in GPU on one hand helps the processers take advantage of data locality occurred in runtime but on(More)
To provide fault tolerance to computer systems suffering from transient faults, checkpointing and rollback recovery is widely-used. Among other techniques, two primary checkpointing schemes have been proposed: independent and coordinated schemes. However, most existing work addresses only the need to employ a single checkpointing and rollback recovery(More)
This paper studies top-down program development techniques for Bulk-Synchronous Parallelism. In that context a specification formalism Logs, for 'the Logic of Global Synchrony', has been proposed for the specification and high-level development of BSP designs. This paper extends the use of Logs to provide support for the protection of local variables in BSP(More)
This paper studies parallel recursion. The trace specification language used in this paper incorporates sequentiality, nondeterminism, reactiveness (including infinite traces), three forms of parallelism (including conjunctive, fair-interleaving and synchronous parallelism) and general recursion. In order to use Tarski's theorem to determine the fixpoints(More)
Muscle represents an attractive target tissue for adeno-associated virus (AAV) vector-mediated gene transfer for hemophilia B (HB). Experience with direct intramuscular (i.m.) administration of AAV vectors in humans showed that the approach is safe but fails to achieve therapeutic efficacy. Here, we present a careful evaluation of the safety profile(More)