A hierarchical framework for cross-domain MapReduce execution

  title={A hierarchical framework for cross-domain MapReduce execution},
  author={Yuan Luo and Zhenhua Guo and Yiming Sun and Beth Plale and Judy Qiu and Wilfred W. Li},
  booktitle={ECMLS '11},
The MapReduce programming model provides an easy way to execute pleasantly parallel applications. Many data-intensive life science applications fit this programming model and benefit from the scalability that can be delivered using this model. One such application is AutoDock, which consists of a suite of automated tools for predicting the bound conformations of flexible ligands to macromolecular targets. However, researchers also need sufficient computation and storage resources to fully enjoy… 

Figures and Tables from this paper

Hierarchical MapReduce: towards simplified cross‐domain data processing

A hierarchical MapReduce framework is presented that utilizes computation resources from multiple clusters simultaneously to run Map Reduce job across them and adopts the Map–Reduce–GlobalReduce model where computations are expressed as three functions: Map, Reduce, and GlobalReduce.

MapReduce and Data Intensive Applications

This paper mainly review the state-of-the-art MapReduce systems for scientific applications and summarizes research issues found in prior studies by analyzing their usage for different MapReduced platforms in HPC-Clouds environments.

Optimizing MapReduce for Highly Distributed Environments

This paper develops a modeling framework to capture MapReduce execution in a highly distributed environment comprising distributed data sources and distributed computational resources and proposes a model-driven optimization that is end-to-end as opposed to myopic optimizations that may only make locally optimal but globally suboptimal decisions.

End-to-End Optimization for Geo-Distributed MapReduce

This paper develops a model-driven optimization that serves as an oracle, providing high-level insights and applies these insights to design cross-phase optimization techniques that are implemented and demonstrated in a real-world MapReduce implementation.

Federated MapReduce to Transparently Run Applications on Multicluster Environment

Federated MapReduce (Fed-MR) is proposed, a framework aimed at analyzing geometrically distributed data among independent organizations while avoiding data movement, which has reasonable overheads in performance for analyzing data across Internet-connected clusters while no additional Global Reduce function was required as in traditional hierarchical Map Reduce frameworks.

MapReduce Join Across Geo-Distributed Data Centers

The issues of running MapReduce Joins in a geo-distributed computing context are discussed and a proposal to boost the performance of the Join algorithm by leveraging a hierarchical computing approach is proposed.

Hierarchical MapReduce Programming Model and Scheduling Algorithms

  • Yuan LuoBeth Plale
  • Computer Science
    2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
  • 2012
A Hierarchical Map Reduce framework that gathers computation resources from different clusters and runs MapReduce jobs across them and introduces two scheduling algorithms.

Mapreduce Algorithms Optimizes the Potential of Big Data

An overview, architecture and components of Hadoop, HCFS (Hadoop Cluster File System) and MapReduce programming model, its components and programming model are provided.

Pilot-MapReduce: an extensible and flexible MapReduce implementation for distributed data

Experimental evaluations show that the Pilot abstractions are powerful abstractions for distributed data: PMR can lower the execution time on distributed clusters and that it provides the desired flexibility in the deployment and configuration of MapReduce runs to address specific application characteristics.

Understanding mapreduce-based next-generation sequencing alignment on distributed cyberinfrastructure

Pilot-MapReduce (PMR) provides an effective means by which a variety of new or existing methods for NGS and downstream analysis can be carried out whilst providing efficiency and scalability across multiple clusters.



MapReduce: simplified data processing on large clusters

This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

Map-reduce-merge: simplified relational data processing on large clusters

A Merge phase is added to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted by map and reduce modules, and it is demonstrated that this new model can express relational algebra operators as well as implement several join algorithms.

Job Scheduling for Multi-User MapReduce Clusters

Two simple techniques, delay scheduling and copy-compute splitting, are developed which improve throughput and response times in multi-user MapReduce workloads by factors of 2 to 10 and can also raise throughput in a single-user, FIFO workload by a factor of 2.

A framework for adaptive execution in grids

A new Globus based framework that allows an easier and more efficient execution of jobs in a ‘submit and forget’ fashion and is currently functional on any Grid testbed based on Globus because it does not require new system software to be installed in the resources.

CloudBATCH: A Batch Job Queuing System on Clouds with Hadoop and HBase

  • Chen ZhangH. D. Sterck
  • Computer Science
    2010 IEEE Second International Conference on Cloud Computing Technology and Science
  • 2010
Cloud BATCH is presented, a prototype solution to this problem enabling Hadoop to function as a traditional batch job queuing system with enhanced functionality for cluster resource management.

Customized Plug-in Modules in Metascheduler Community Scheduler Framework 4 ( CSF 4 ) for Life Sciences Applications

The metascheduler CSF4 is extended by providing a Virtual Job Model (VJM) to synchronize the resource co-allocation for cross-domain parallel jobs, which eliminates dead-locks and improves resource usage for multi-cluster parallel applications compiled with MPICH-G2.

BDT: an easy-to-use front-end application for automation of massive docking tasks and complex docking strategies with AutoDock

BDT, an easy-to-use graphic interface for AutoGrid/AutoDock, an obstacle for research teams in the fields of Chemistry and the Life Sciences who are interested in conducting a blind-docking experiment with the whole receptor surface.

Condor-G: A Computation Management Agent for Multi-Institutional Grids

It is asserted that Condor-G can serve as a general-purpose interface to Grid resources, for use by both end users and higher-level program development tools.

Nimrod/G: an architecture for a resource management and scheduling system in a global computational grid

  • R. BuyyaD. AbramsonJ. Giddy
  • Computer Science
    Proceedings Fourth International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region
  • 2000
The proposed Nimrod/G grid-enabled resource management and scheduling system builds on the earlier work on Nimrod and follows a modular and component-based architecture enabling extensibility, portability, ease of development, and interoperability of independently developed components.

Sun Grid Engine: towards creating a compute power grid

  • W. Gentzsch
  • Computer Science
    Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid
  • 2001
This paper provides an up-to-date overview of the Sun Grid Engine distributed resource management software and of future plans to enhance it towards and integrate it into computational grid