• Corpus ID: 19934321

A Survey on Data-Centric and Data-Aware Techniques for Large Scale Infrastructures

@article{CanoLores2016ASO,
  title={A Survey on Data-Centric and Data-Aware Techniques for Large Scale Infrastructures},
  author={Silvina Ca{\'i}no-Lores and Jes{\'u}s Carretero},
  journal={World Academy of Science, Engineering and Technology, International Journal of Computer and Information Engineering},
  year={2016},
  volume={10},
  pages={517-523}
}
  • Silvina Caíno-LoresJ. Carretero
  • Published 1 February 2016
  • Computer Science
  • World Academy of Science, Engineering and Technology, International Journal of Computer and Information Engineering
Large scale computing infrastructures have been widely developed with the core objective of providing a suitable platform for high-performance and high-throughput computing. These systems are designed to support resource-intensive and complex applications, which can be found in many scientific and industrial areas. Currently, large scale data-intensive applications are hindered by the high latencies that result from the access to vastly distributed data. Recent works have suggested that… 

Figures from this paper

On the effects of allocation strategies for exascale computing systems with distributed storage and unified interconnects

This paper investigates alternatives for the storage subsystem of a novel exascale‐capable system with special emphasis on how allocation strategies would affect the overall performance, and suggests that scheduling policies exposing data‐locality information can be essential for the appropriate utilization of future large‐scale systems.

On the Effects of Data-Aware Allocation on Fully Distributed Storage Systems for Exascale

The need of enhancing system schedulers to differentiate between compute- and data-oriented applications to minimise interferences between storage and application traffic is shown.

Data Locality in High Performance Computing, Big Data, and Converged Systems: An Analysis of the Cutting Edge and a Future System Architecture

An extensive review of cutting-edge research on data locality in HPC, big data, and converged environments is provided and a system architecture for future HPC and big data converged systems is proposed.

JHTD: An Efficient Joint Scheduling Framework Based on Hypergraph for Task Placement and Data Transfer Across Geographically Distributed Data Centers

This work proposes an efficient joint scheduling framework based on hypergraph for task placement and data transfer across geographically distributed data centers and demonstrates that the results have demonstrated that <inline-formula> <tex-math notation="LaTeX">$JHTD$ </tex- Math> can reduce the makespan by up to 20.6%.

Hypergraph+: An Improved Hypergraph-Based Task-Scheduling Algorithm for Massive Spatial Data Processing on Master-Slave Platforms

An extended hypergraph-based task-scheduling algorithm, named Hypergraph+, is proposed for massive spatial data processing and improves upon current hypergraph scheduling algorithms in two ways: It takes platform heterogeneity into consideration offering a metric function to evaluate the partitioning quality in order to derive the best task/file schedule.

References

SHOWING 1-10 OF 42 REFERENCES

New Worker-Centric Scheduling Strategies for Data-Intensive Grid Applications

This paper proposes a series of workercentric scheduling strategies for data-intensive applications and evaluates how each strategy performs compared to a task-centric one, showing that worker-centric strategies improve the performance in terms of makespan and bandwidth usage.

VIDAS: object-based virtualized data sharing for high performance storage I/O

With scientific computing in the cloud gaining popularity and using every time larger data sets, high performance storage I/O in virtualized environments is substantially increasing in importance.

Exploiting Replication and Data Reuse to Efficiently Schedule Data-Intensive Applications on Grids

Storage Affinity exploits a data reuse pattern, common on many data-intensive applications, that allows it to take data transfer delays into account and reduce the makespan of the application, and uses a replication strategy that yields efficient schedules without relying upon dynamic information that is difficult to obtain.

Parallel Programming Paradigms and Frameworks in Big Data Era

  • C. DobreF. Xhafa
  • Computer Science
    International Journal of Parallel Programming
  • 2013
This paper discusses and analyzes opportunities and challenges for efficient parallel data processing, and reviews various parallel and distributed programming paradigms, analyzing how they fit into the Big Data era, and present modern emerging paradigm and frameworks.

HaLoop: Efficient Iterative Data Processing on Large Clusters

HaLoop is presented, a modified version of the Hadoop MapReduce framework that is designed to serve iterative applications and dramatically improves their efficiency by making the task scheduler loop-aware and by adding various caching mechanisms.

A Data Locality Aware Online Scheduling Approach for I/O-Intensive Jobs with File Sharing

A hypergraph based dynamic scheduling heuristic for a stream of independent I/O intensive jobs with file sharing behavior based on an event-driven, run-time hypergraph modeling of the file sharing characteristics among jobs is proposed.

Nephele: efficient parallel data processing in the cloud

Nephele is the first data processing framework to explicitly exploit the dynamic resource allocation offered by today's compute clouds for both, task scheduling and execution and is presented as an ongoing research project.

An Efficient Data Locality Driven Task Scheduling Algorithm for Cloud Computing

This work proposes a heuristic task scheduling algorithm in which an initial task allocation will be produced at first, and then the job completion time can be reduced gradually by tuning the initial task assignment.

A new paradigm: Data-aware scheduling in grid computing

BAR: An Efficient Data Locality Driven Task Scheduling Algorithm for Cloud Computing

A heuristic task scheduling algorithm called Balance-Reduce (BAR), in which an initial task allocation will be produced at first, then the job completion time can be reduced gradually by tuning the initial task allocated, by taking a global view.