BSP cost and scalability analysis for MapReduce operations

  title={BSP cost and scalability analysis for MapReduce operations},
  author={Hermes Senger and Veronica Gil-Costa and Luciana Arantes and Cesar Augusto Cavalheiro Marcondes and Mauricio Mar{\'i}n and Liria Matsumoto Sato and Fabr{\'i}cio A. B. Silva},
  journal={Concurrency and Computation: Practice and Experience},
  pages={2503 - 2527}
Data abundance poses the need for powerful and easy‐to‐use tools that support processing large amounts of data. MapReduce has been increasingly adopted for over a decade by many companies, and more recently, it has attracted the attention of an increasing number of researchers in several areas. One main advantage is that the complex details of parallel processing, such as complex network programming, task scheduling, data placement, and fault tolerance, are hidden in a conceptually simple… 

MapReduce and Its Applications, Challenges, and Architecture: a Comprehensive Review and Directions for Future Research

This paper provides a discussion of the differences between varied implementations of MapReduce as well as some guidelines for planning future research.

Parallel and distributed computing for Big Data applications

This special issue contains eight papers presenting recent advances on parallel and distributed computing for Big Data applications, focusing on their scalability and performance.

SparkBLAST: scalable BLAST processing using in-memory operations

SparkBLAST, a parallelization of a sequence alignment application that employs cloud computing for the provisioning of computational resources and Apache Spark as the coordination framework, outperforms an equivalent system implemented on Hadoop in terms of speedup and execution times.

Evaluating The Scalability of Big Data Frameworks

How the scalability in big data is governed by a constant of scalability (β) is discussed, which confirms that isoefficiency presented a linear growth as the size of the data sets was increased.

Multi-BSP vs. BSP: A Case of Study for Dell AMD Multicores

This work model two different multi-core Dell architectures and shows that a simple model with few parameters can be easily adapted to each Dell platform rather than complex models which tends to use tricky hardware parameters.

Extracting sample data based on poisson distribution

A novel Poisson-based sampling method is introduced to provide a comprehensive data set for real time analysis and the experimental results show efficiency of the proposed method.

Functional abstraction for programming multi-level architectures : formalisation and implementation. (Abstraction fonctionnelle pour la programmation d'architecture multi-niveaux : formalisation et implantation)

The Multi-ML language is introduced, which allows programming Multi- BSP algorithms “a la ML” and thus, guarantees the properties of the Multi-BSP model and the execution safety, thanks to a ML type system.

Improving parallel performance of ensemble learners for streaming data through data locality with mini-batching

  • G. CassalesHeitor Murilo GomesA. BifetB. PfahringerH. Senger
  • Computer Science
    2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
  • 2020
A minibatching strategy is proposed which can significantly reduce cache misses and improve the performance of several ensemble algorithms for stream mining in multi-core environments and shows speedups of up to 5X on 8-core processors with ensembles of 100 and 150 learners.

Big Data Analytics in Weather Forecasting: A Systematic Review

This paper tenders a systematic literature review method for big data analytic approaches in weather forecasting (published between 2014 and August 2020) and presents a comparison of the aforementioned categories regarding accuracy, scalability, execution time, and other Quality of Service factors.



The family of mapreduce and large-scale data processing systems

This article provides a comprehensive survey for a family of approaches and mechanisms of large-scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities.

A survey of large-scale analytical query processing in MapReduce

A taxonomy is presented for categorizing existing research on MapReduce improvements according to the specific problem they target, and interesting directions for future parallel data processing systems are outlined.

Load Balancing in MapReduce Based on Scalable Cardinality Estimates

This paper addresses the problem of estimating the cost of the tasks that are distributed to the reducers based on a given cost model and consists of a monitoring component executed on every mapper that captures the local data distribution and identifies its most relevant subset for cost estimation.

MRShare: Sharing Across Multiple Queries in MapReduce

A sharing framework tailored to MapReduce is proposed that transforms a batch of queries into a new batch that will be executed more efficiently, by merging jobs into groups and evaluating each group as a single query.

Online aggregation and continuous query support in MapReduce

A modified version of the Hadoop MapReduce framework that supports online aggregation, which allows users to see "early returns" from a job as it is being computed, and can reduce completion times and improve system utilization for batch jobs as well.

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

This paper explores the feasibility of building a hybrid system that takes the best features from both technologies; the prototype built approaches parallel databases in performance and efficiency, yet still yields the scalability, fault tolerance, and flexibility of MapReduce-based systems.

A model of computation for MapReduce

A simulation lemma is proved showing that a large class of PRAM algorithms can be efficiently simulated via MapReduce, and it is demonstrated how algorithms can take advantage of this fact to compute an MST of a dense graph in only two rounds.

Profiling, what-if analysis, and cost-based optimization of MapReduce programs

This work introduces, to its knowledge, the first Cost-based Optimizer for simple to arbitrarily complex MapReduce programs, which focuses on the optimization opportunities presented by the large space of configuration parameters for these programs.

ReStore: reusing results of MapReduce jobs in pig

ReStore is demonstrated, an extension to Pig that enables it to manage storage and reuse of intermediate results of the MapReduce workflows executed in the Pig data analysis system, and the rewriting of input queries to reuse stored intermediate results.

Early Accurate Results for Advanced Analytics on MapReduce

This paper proposed and implemented a non-parametric extension of Hadoop which allows the incremental computation of early results for arbitrary work-flows, along with reliable on-line estimates of the degree of accuracy achieved so far in the computation.