HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

  title={HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads},
  author={Azza Abouzeid and Kamil Bajda-Pawlikowski and Daniel J. Abadi and Alexander Rasin and Abraham Silberschatz},
  journal={Proc. VLDB Endow.},
The production environment for analytical data management applications is rapidly changing. Many enterprises are shifting away from deploying their analytical databases on high-end proprietary machines, and moving towards cheaper, lower-end, commodity hardware, typically arranged in a shared-nothing MPP architecture, often in a virtualized environment inside public or private "clouds". At the same time, the amount of data that needs to be analyzed is exploding, requiring hundreds to thousands… 
Tradeoffs between Parallel Database Systems, Hadoop, and HadoopDB as Platforms for Petabyte-Scale Analysis
This talk describes some experiences in using parallel databases and MapReduce systems, and a hybrid system that the author is building at Yale University, called HadoopDB, that attempts to combine the advantages of both types of platforms.
Intentional Data Placement Optimization for Distributed Data Warehouses
A mapReduce data blocks allocation approach to improve MapReduce jobs execution and query performances on multi-nodes clusters, especially Hadoop clusters and suggests that defining a good data placement on a cluster during the implementation of a data warehouse increase significantly the OLAP cube construction and querying performances.
A Comparison of MapReduce and Parallel Database Management Systems
The aim of this paper is to provide a high-level comparison between MapReduce and Parallel DBMS, providing a selection of criteria which can be used to choose between Map Reduce and parallel DBMS for a particular enterprise application.
Scalability and Performance for Data Management in the Cloud
This paper reviews existing systems capable of handling large scale data, starting with Gamma, one of the first and most influential parallel databases, up to modern data management frameworks such as MapReduce and HadoopDB.
LinearDB: A Relational Approach to Make Data Warehouse Scale Like MapReduce
This paper designs a prototype system called LinearDB, which organizes data in a decomposed snowflake schema and adopts three operations - transform, reduce and merge - to accomplish query processing and shows that its scalability matches MapReduce and its performance is up to 3 times as good as that of PostgreSQL.
PARADISE: Big data analytics using the DBMS tightly integrated with the distributed file system
This paper proposes a new notion of the DFS-integrated DBMS where a DBMS is tightly integrated with the distributed file system (DFS), and shows that PARADISE effectively overcomes the drawbacks of HadoopDB by identifying the following strengths.
Towards Conceptual MapReduce Algorithm for Big Data Platform
  • Seung-Beom Sohn, Jin-Hong Kim
  • Computer Science
    2015 International Conference on Computational Intelligence and Communication Networks (CICN)
  • 2015
The notion of design pattern of MapReduce is presented, which instantiate arrangements of components and specific techniques designed to handle frequently encountered situations across a variety of domains.
The performance of MapReduce
By carefully tuning these factors, the overall performance of Hadoop can be improved by a factor of 2.5 to 3.5, and is thus more comparable to that of parallel database systems.
Big Data Processing Systems
This chapter provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities.
Subject: Distributed Database Systems Q1) Map Reduce and Distributed Databases
A marriage of MapReduce with DBMS Technologies has been touted as the new approach towards achieving both performance on the one hand as well as scalability and flexibility on the other hand.


A comparison of approaches to large-scale data analysis
A benchmark consisting of a collection of tasks that are run on an open source version of MR as well as on two parallel DBMSs shows a dramatic performance difference between the two paradigms.
SCOPE: easy and efficient parallel processing of massive data sets
A new declarative and extensible scripting language, SCOPE (Structured Computations Optimized for Parallel Execution), targeted for this type of massive data analysis, designed for ease of use with no explicit parallelism, while being amenable to efficient parallel execution on large clusters.
Pig latin: a not-so-foreign language for data processing
A new language called Pig Latin is described, designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce, which is an open-source, Apache-incubator project, and available for general use.
GAMMA - A High Performance Dataflow Database Machine
The Gamma prototype shows how parallelism can be controlled with minimal control overhead through a combination of the use of algorithms based on hashing and the pipelining of data between processes.
An Overview of The System Software of A Parallel Relational Database Machine GRACE
The novel virtual space management algorithm is proposed, which enables the system software to handle a iarge data stream qui1.IIc I/O’s or da1.ly.
Cooperative Expendable Micro-Slice Servers (CEMS): Low Cost, Low Power Servers for Internet-Scale Services
  • 2008
C-Store: A Column-oriented DBMS
Preliminary performance data on a subset of TPC-H is presented and it is shown that the system the team is building, C-Store, is substantially faster than popular commercial products.
Xen and the art of virtualization
This research presents a meta-modelling architecture that automates the very labor-intensive and therefore time-heavy and therefore expensive process of manually cataloging and partitioning the resources of a modern computer.
Worldwide RDBMS 2005 vendor shares
  • IDC
  • 2006
Sorting 1pb with mapreduce
  • 2008