MonetDB/DataCell: Online Analytics in a Streaming Column-Store

  title={MonetDB/DataCell: Online Analytics in a Streaming Column-Store},
  author={Erietta Liarou and Stratos Idreos and Stefan Manegold and Martin L. Kersten},
  journal={Proc. VLDB Endow.},
In DataCell, we design streaming functionalities in a modern relational database kernel which targets big data analytics. This includes exploitation of both its storage/execution engine and its optimizer infrastructure. We investigate the opportunities and challenges that arise with such a direction and we show that it carries significant advantages for modern applications in need for online analytics such as web logs, network monitoring and scientific data management. The major challenge then… 

Figures from this paper

Enhanced stream processing in a DBMS kernel
This paper focuses on incremental window-based processing, arguably the most crucial streamspecific requirement, and designs a stream engine on top of an existing relational database kernel, in order to maintain and reuse the generic storage and execution model of the DBMS.
Database support for processing complex aggregate queries over data streams
The goal of this thesis is to investigate the potential of combining database systems with SPEs in the context of stream processing so as to improve the overall query evaluation performance.
SnappyData : Streaming , Transactions , and Interactive Analytics in a Unified Engine
SnappyData is the first to offer end users an intuitive means for expressing their accuracy requirements without overwhelming them with statistical concepts, through a novel concept of high-level accuracy contracts (HAC).
SnappyData: A Unified Cluster for Streaming, Transactions and Interactice Analytics
SnappyData is presented as the first unified engine capable of delivering analytics, transactions, and stream processing in a single integrated cluster by carefully marrying a big data computational engine with a scale-out transactional store.
ENTRADA: A high-performance network traffic data streaming warehouse
We present ENTRADA, a high-performance data streaming warehouse that enables researchers and operators to analyze vast amounts of network traffic and measurement data within interactive response
SnappyData: A Hybrid Transactional Analytical Store Built On Spark
This work proposes a unified engine for real-time operational analytics, delivering stream analytics, OLTP and OLAP in a single integrated solution through a seamless integration of Apache Spark (as a big data computational engine) with GemFire (as an in-memory transactional store with scale-out SQL semantics).
DBStream: An online aggregation, filtering and processing system for network traffic monitoring
DBStream is introduced, a novel online traffic monitoring system based on the DSW paradigm, which allows fast and flexible analysis across multiple heterogeneous data sources, and provides a novel stream processing language for implementing data processing modules, as well as aggregation, filtering, and storage capabilities for further data analysis.
Large-scale network traffic monitoring with DBStream, a system for rolling big data analysis
DBStream is described, which is an SQL-based system that explicitly supports incremental queries for rolling data analysis and is presented a performance comparison of DBStream with a parallel data processing engine (Spark), showing that, in some scenarios, a single DBStream node can outperform a cluster of ten Spark nodes on rolling network monitoring workloads.
DBStream: A holistic approach to large-scale network traffic monitoring and analysis
A thin monitoring layer for top-k aggregation queries over a database
The proposed family of maintenance algorithms further exploits the relations between the monitored rankings known from multi-query optimisation, and presents results of a preliminary experimental evaluation using TPC-H data to study the performance of the algorithms.


Experience in Extending Query Engine for Continuous Analytics
A new kind of tightly integrated, highly efficient system with the advanced stream processing capability as well as the full DBMS functionality is resulted, which can significantly reduce the engineering investment needed for developing the streaming technology.
Continuous Analytics: Rethinking Query Processing in a Network-Effect World
This paper describes the Continuous Analytics approach and outlines some of the key technical arguments behind it, creating a powerful and flexible system that can run SQL over tables, streams, and combinations of the two.
Exploiting the power of relational databases for efficient stream processing
A complete architecture is proposed, the DataCell, which is implemented on top of an open-source column-oriented DBMS, which allows batch processing of tuples and selectively pick tuples from a basket based on the query requirements exploiting a novel query component, the basket expressions.
TelegraphCQ: continuous dataflow processing
The current version of TelegraphCQ is shown, which is implemented by leveraging the code base of the open source PostgreSQL database system, which found that a significant portion of the PostgreSQL code was easily reusable.
Operator scheduling in data stream systems
The aim is to design a scheduling strategy that minimizes the maximum runtime system memory while maintaining the output latency within prespecified bounds, and presents Chain scheduling, an operator scheduling strategy for data stream systems that is near-optimal in minimizing runtime memory usage.
Algorithms and metrics for processing multiple heterogeneous continuous queries
This article examines the problem of how to schedule multiple Continuous Queries in a DSMS to optimize different Quality of Service (QoS) metrics, and proposes a hybrid scheduling policy that strikes a fine balance between performance and fairness.
The Case for a Signal-Oriented Data Stream Management System
This paper motivates the need for a data management and continuous query processing architecture that integrates two different desired classes of functions into a single, unified software system.
IBM infosphere streams for scalable, real-time, intelligent transportation services
A prototype system that generates dynamic, multi-faceted views of transportation information for the city of Stockholm, using real vehicle GPS and road-network data is described and the use of IBM InfoSphere Streams, a scalable stream processing platform, is demonstrated.
Self-organizing tuple reconstruction in column-stores
A novel design, partial sideways cracking, is proposed that achieves performance similar to using presorted data, but without requiring the heavy initial presorting step itself, and brings significant performance benefits for multi-attribute queries.
NiagaraCQ: a scalable continuous query system for Internet databases
The design of NiagaraCQ is presented, some experimental results on the system's performance and scalability are given and other techniques including incremental evaluation of continuous queries, use of both pull and push models for detecting heterogeneous data source changes, and memory caching are employed.