SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures

@article{Floratou2014SQLonHadoopFC,
  title={SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures},
  author={Avrilia Floratou and Umar Farooq Minhas and Fatma {\"O}zcan},
  journal={Proc. VLDB Endow.},
  year={2014},
  volume={7},
  pages={1295-1306}
}
SQL query processing for analytics over Hadoop data has recently gained significant traction. Among many systems providing some SQL support over Hadoop, Hive is the first native Hadoop system that uses an underlying framework such as MapReduce or Tez to process SQL-like statements. Impala, on the other hand, represents the new emerging class of SQL-on-Hadoop systems that exploit a shared-nothing parallel database architecture over Hadoop. Both systems optimize their data ingestion via columnar… 

Performance analysis of shared-nothing SQL-on-Hadoop frameworks based on columnar database systems

TLDR
This study performs a comparative analysis of the SQL-on-Hadoop systems by comparing their performance with various hardware and software parameters and shows that Impala outperforms Hive and Tajo when the workload dataset fits in its memory.

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware

TLDR
A benchmark based on a denormalized version of the TPC-H is used to compare the performance of Hive on Tez, Spark, Presto and Drill, and makes available interesting findings regarding an architecture and infrastructure in SQL-on-Hadoop for Big Data Warehousing, helping practitioners and fostering future research.

A comparative analysis of state-of-the-art SQL-on-Hadoop systems for interactive analytics

TLDR
This work performs a comparative analysis of four state-of-the-art SQL-on-Hadoop systems (Impala, Drill, Spark SQL and Phoenix) using the Web Data Analytics micro benchmark and the TPC-H benchmark on the Amazon EC2 cloud platform to provide insights into performance variations, performance bottlenecks and query execution characteristics.

SQL Query Performance on Hadoop: An Analysis Focused on Large Databases of Brazilian Electronic Invoices

TLDR
This work analyzes the performance of SQL queries on Hadoop, using the Impala engine, comparing it with a RDBMS-based approach, and shows speedups from 2.7 to 14x with Impala/Hadoop for the queries considered, on a lower cost hardware/software platform.

Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs

TLDR
This paper evaluated the TPC-DS benchmark on a combination of query engines (Spark and Tez) and JVMs (J9 and OpenJDK) and proposed classification models for selecting the best combination of systems with a generated query plan.

Vortex : taking SQL-on-Hadoop to the next level

TLDR
This work describes the main technical extensions to single-server Vectorwise that turned it into a Hadoop-based MPP system, in terms of workload management, parallel query optimization and execution, HDFS storage, transaction processing and Spark integration.

A High-Performance Distributed Relational Database System for Scalable OLAP Processing

TLDR
This work presents HRDBMS, a fully implemented distributed shared-nothing relational database developed with the goal of improving the scalability of OLAP queries that achieves high scalability through a principled combination of techniques from relational and big data systems with novel communication and work-distribution techniques.

Big SQL 3.0 Functionality: A comprehensive approach to SQL-on-Hadoop

TLDR
Big SQL 3.0 achieves several important objectives including comprehensive support of ANSI SQL 2011, application integration and portability, query federation, and enterprise capabilities such as performance, security, monitoring.

VectorH: Taking SQL-on-Hadoop to the Next Level

TLDR
This work describes the changes made to single-server Vectorwise to turn it into a Hadoop-based MPP system, encompassing workload management, parallel query optimization and execution, HDFS storage, transaction processing and Spark integration, and evaluates VectorH against HAWQ, Impala, SparkSQL and Hive.

TPCx-BB (Big Bench) in a Single-Node Environment

TLDR
This paper presents response times of all 30 BigBench queries when run sequentially to showcase the advanced analytics and machine learning capabilities integrated within SQL Server 2019, and presents results from data scalability experiments over two scale factors to understand the impact of increase in data size on query runtimes.
...

References

SHOWING 1-10 OF 21 REFERENCES

Shark: SQL and rich analytics at scale

TLDR
Shark is a new data analysis system that marries query processing with complex analytics on large clusters and extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL.

Hive - a petabyte scale data warehouse using Hadoop

TLDR
Hive is presented, an open-source data warehousing solution built on top of Hadoop that supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into map-reduce jobs that are executed using Hadoops.

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

TLDR
This paper explores the feasibility of building a hybrid system that takes the best features from both technologies; the prototype built approaches parallel databases in performance and efficiency, yet still yields the scalability, fault tolerance, and flexibility of MapReduce-based systems.

HAWQ: a massively parallel processing SQL engine in hadoop

TLDR
The novel design of HAWQ is presented, including query processing, the scalable software interconnect based on UDP protocol, transaction management, fault tolerance, read optimized storage, the extensible framework for supporting various popular Hadoop based data stores and formats, and various optimization choices the authors considered to enhance the query performance.

YSmart: Yet Another SQL-to-MapReduce Translator

TLDR
Y Smart, a correlation aware SQL-to-MapReduce translator that applies a set of rules to use the minimal number of MapReduce jobs to execute multiple correlated operations in a complex query, can significantly reduce redundant computations, I/O operations and network transfers compared to existing translators.

RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems

TLDR
This paper presents a big data placement structure called RCFile (Record Columnar File) and its implementation in the Hadoop system and shows the effectiveness of RCFile in satisfying the four requirements.

Can the Elephants Handle the NoSQL Onslaught?

TLDR
This paper compares one representative NoSQL system from each end of this spectrum with SQL Server, and analyzes the performance and scalability aspects of each of these approaches on two workloads that represent the two ends of the application spectrum.

Data page layouts for relational databases on deep memory hierarchies

TLDR
This paper proposes a new data organization model called PAX (Partition Attributes Across), that significantly improves cache performance by grouping together all values of each attribute within each page, and shows that PAX performs well across different memory system designs.

C-Store: A Column-oriented DBMS

TLDR
Preliminary performance data on a subset of TPC-H is presented and it is shown that the system the team is building, C-Store, is substantially faster than popular commercial products.

Column oriented Database Systems

TLDR
This tutorial presents an overview of column-oriented database system technology and addresses questions about how easily a major row-based system achieve column-store performance and the new applications that can be potentially enabled by column-stores.