A comparison of approaches to large-scale data analysis

@article{Pavlo2009ACO,
  title={A comparison of approaches to large-scale data analysis},
  author={Andrew Pavlo and Erik Paulson and Alexander Rasin and Daniel J. Abadi and David J. DeWitt and Samuel Madden and Michael Stonebraker},
  journal={Proceedings of the 2009 ACM SIGMOD International Conference on Management of data},
  year={2009}
}
There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark… 

Figures from this paper

Big Data Processing Systems
TLDR
This chapter provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities.
The family of mapreduce and large-scale data processing systems
TLDR
This article provides a comprehensive survey for a family of approaches and mechanisms of large-scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities.
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
TLDR
This paper explores the feasibility of building a hybrid system that takes the best features from both technologies; the prototype built approaches parallel databases in performance and efficiency, yet still yields the scalability, fault tolerance, and flexibility of MapReduce-based systems.
COMPARISON OF TABLE JOIN EXECUTION TIME FOR PARALLEL DBMS AND MAPREDUCE
TLDR
Detailed process models for table joins in the parallel row-storage DBMS and MR system and results of detailed calculation experiments performed on these models showed that with the increase of the stored data volume parallel DBMS starts losing against MR-system at certain thresholds.
Performance Characterization of Modern Databases on Out-of-Order CPUs
TLDR
It is observed that performance of modern databases is severely limited by poor cache/memory performance, and it is demonstrated that dynamic execution techniques are still effective in hiding a significant fraction of the stalls, thereby improving performance.
Integration of Large-Scale Data Processing Systems and Traditional Parallel Database Technology
TLDR
This paper provides an overview of HadoopDB’s original design, and its evolution during the subsequent ten years of research and development effort, and describes how the project innovated both in the research lab, and as a commercial product at Hadapt and Teradata.
SINGLE vs. MapReduce vs. Relational: Predicting Query Execution Time
TLDR
This paper introduces and tested a storage alternative which goes against current data normalization premises, where storage space is no longer a concern, and proposes a new concept system where query execution time must be entirely predictable, independently of its complexity, called, SINGLE.
The Family of Map-Reduce
TLDR
This chapter provides a comprehensive survey for a family of approaches and mechanisms of large scale data analysis that have been implemented based on the original father idea of the MapReduce framework, and are currently gaining a lot of momentum in both research and industrial communities.
Tradeoffs between Parallel Database Systems, Hadoop, and HadoopDB as Platforms for Petabyte-Scale Analysis
TLDR
This talk describes some experiences in using parallel databases and MapReduce systems, and a hybrid system that the author is building at Yale University, called HadoopDB, that attempts to combine the advantages of both types of platforms.
...
...

References

SHOWING 1-10 OF 24 REFERENCES
SCOPE: easy and efficient parallel processing of massive data sets
TLDR
A new declarative and extensible scripting language, SCOPE (Structured Computations Optimized for Parallel Execution), targeted for this type of massive data analysis, designed for ease of use with no explicit parallelism, while being amenable to efficient parallel execution on large clusters.
Pig latin: a not-so-foreign language for data processing
TLDR
A new language called Pig Latin is described, designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce, which is an open-source, Apache-incubator project, and available for general use.
GAMMA - A High Performance Dataflow Database Machine
TLDR
The Gamma prototype shows how parallelism can be controlled with minimal control overhead through a combination of the use of algorithms based on hashing and the pipelining of data between processes.
LINQ: reconciling object, relations and XML in the .NET framework
TLDR
The .NET Language-Integrated Query (LINQ) framework, proposed for the next release of the .NET framework, approaches the problem of handling data from different data models by defining a pattern of general-purpose standard query operators for traversal, filter, and projection.
Dryad: distributed data-parallel programs from sequential building blocks
TLDR
The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.
MAD Skills: New Analysis Practices for Big Data
TLDR
This paper highlights the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence, and describes database design methodologies that support the agile working style of analysts in these settings.
An Overview of The System Software of A Parallel Relational Database Machine GRACE
TLDR
The novel virtual space management algorithm is proposed, which enables the system software to handle a iarge data stream qui1.IIc I/O’s or da1.ly.
Implementation of data abstraction in the relational database system INGRES
TLDR
The design and implementation of an abstract data type (ADT) facility which was added to the INGRES database manager and possible extensions to this new facility are described.
Multiprocessor Hash-Based Join Algorithms
TLDR
It is demonstrated that bit vector filtering provides dramatic improvement in the performance of all algorithms including the sort mergejoin algorithm, and is shown to provide linear increases in throughput with corresponding increases in processor and disk resources.
The Case for Shared Nothing
TLDR
This paper argues that shared nothing is the preferred approach to building high transaction rate multiprocessor systems.
...
...