Juan Fernández Peinador

Learn More
The Cell Broadband Engine (BE) processor provides the potential to achieve an impressive level of performance for scientific applications. This level of performance can be reached by exploiting several dimensions of parallelism, such as thread-level parallelism using several Synergistic Processing Elements, data streaming parallelism, vector parallelism in(More)
In the near future large-scale parallel computers will feature hundreds of thousands of processing nodes. In such systems, fault tolerance is critical as failures will occur very often. Checkpointing and rollback recovery has been extensively studied as an attempt to provide fault tolerance. However, current implementations do not provide the total(More)
Fine-grained parallel applications require all their processes to run simultaneously on distinct processors to achieve good efficiency. This is typically achieved by space slicing with variable partitioning, wherein nodes are dedicated for the duration of the run, or by gang scheduling, wherein time slicing is coordinated across processors. Both schemes(More)
The Cell Broadband Engine (Cell BE) is a recent heterogeneous chip-multiprocessor (CMP) architecture jointly developed by IBM, Sony and Toshiba to offer very high performance , especially on game and multimedia applications. The significant number of processor cores that it contains (nine in its first generation), along with their heterogeneity (they are of(More)
Although workstation clusters are a common platform for high-performance computing (HPC), they remain more difficult to manage than sequential systems or even symmetric multiprocessors. Furthermore, as cluster sizes increase, the quality of the resource-management subsystem---essentially, all of the code that runs on a cluster other than the(More)
Many scientific and high-performance computing applications consist of multiple processes running on different processors that communicate frequently. These applications can suffer severe performance penalties if their processes are not all coscheduled to run together, due to their synchronization needs. Two common approaches to coschedule jobs are batch(More)
Many parallel algorithms require efficient reduction collectives. In response, researchers have designed algorithms considering a range of parameters including data size, system size, and communication characteristics. Throughout this past work, however, processing was limited to the host CPU. Today, modern Network Interface Cards (NICs) sport programmable(More)
Jobs that run on parallel systems that use gang scheduling for multiprogramming may interact with each other in various ways. These interactions are affected by system parameters such as the level of multiprogramming and the scheduling time quantum. A careful evaluation is therefore required in order to find parameter values that lead to optimal(More)
Multi-core processors are a shift of paradigm in computer architecture that promises a dramatic increase in performance. But multi-core processors also bring an unprecedented level of complexity in algorithmic design and software development. In this paper we describe the challenges and design choices involved in parallelizing a breadth-first search (BFS)(More)
There is a multicore platform that is currently concentrating an enormous attention due to its tremendous potential in terms of sustained performance: the NVIDIA Tesla boards. These cards intended for general-purpose computing on graphic processing units (GPGPUs) are used as data-parallel computing devices. They are based on the Computed Unified Device(More)