AIDA - Abstraction for Advanced In-Database Analytics

@article{Dsilva2018AIDAA,
  title={AIDA - Abstraction for Advanced In-Database Analytics},
  author={Joseph Vinish D'silva and Florestan De Moor and Bettina Kemme},
  journal={Proc. VLDB Endow.},
  year={2018},
  volume={11},
  pages={1400-1413}
}
With the tremendous growth in data science and machine learning, it has become increasingly clear that traditional relational database management systems (RDBMS) are lacking appropriate support for the programming paradigms required by such applications, whose developers prefer tools that perform the computation outside the database system. While the database community has attempted to integrate some of these tools in the RDBMS, this has not swayed the trend as existing solutions are often not… 

Figures from this paper

Making an RDBMS Data Scientist Friendly: Advanced In-database Interactive Analytics with Visualization Support
We are currently witnessing the rapid evolution and adoption of various data science frameworks that function external to the database. Any support from conventional RDBMS implementations for data
Keep Your Host Language Object and Also Query it: A Case for SQL Query Support in RDBMS for Host Language Objects
TLDR
This paper proposes and implements the concept of virtual tables that can be used to expose data set objects maintained by the embedded HLL interpreter to the query engine for executing relational operations, and facilitates better optimization opportunities for the execution of SQL queries.
Accelerating Database Queries for Advanced Data Analytics: A New Approach
TLDR
This paper proposes an advanced analytical system HorsePower, based on HorseIR, an array-based intermediate representation (IR), designed for the translation of conventional database queries, statistical languages, as well as the mix of these two into a common IR, allowing to combine query optimization and compiler optimization techniques at an intermediate level of abstraction.
HorsePower: Accelerating Database Queries for Advanced Data Analytics
TLDR
This paper proposes an advanced analytical system HorsePower, based on HorseIR, an array-based intermediate representation (IR), designed for the translation of conventional database queries, statistical languages, as well as the mix of these two into a common IR, allowing to combine query optimization and compiler optimization techniques at an intermediate level of abstraction.
An Unified System for Data Analytics and In Situ Query Processing
TLDR
The Python’s Dask framework is extended to present DaskDB, a scalable data science system with support for unified data analytics and in situ SQL query processing on heterogeneous data sources and a novel distributed learned index to improve join performance.
DaskDB: Scalable Data Science with Unified Data Analytics and In Situ Query Processing
TLDR
This work presents DaskDB, a scalable data science system with support for unified data analytics and in situ SQL query processing on heterogeneous data sources, and introduces a distributed index join algorithm and a novel distributed learned index to improve join performance.
COMPARE: Accelerating Groupwise Comparison in Relational Databases for Data Analytics
TLDR
This paper extends the database engine with optimization techniques that exploit the semantics of COMPARE to significantly improve the performance of such queries and implements these extensions inside Microsoft SQL Server, a commercial DBMS engine.
The collection Virtual Machine: an abstraction for multi-frontend multi-backend data analysis
TLDR
This paper proposes the "Collection Virtual Machine" (or CVM)---an extensible compiler framework designed to keep the specialization process of data analytics systems tractable and improves the interoperability of both analyses and hardware platforms.
Flexible Rule-Based Decomposition and Metadata Independence in Modin: A Parallel Dataframe System
TLDR
Modin translates pandas functions into a core set of operators that are individually parallelized via columnar, row-wise, or cell-wise decomposition rules that are formalized in this paper and introduces metadata independence to allow metadata to be decoupled from the physical representation and maintained lazily.
Scalable unified data analytics
TLDR
This thesis evaluates data analytics systems that support the data science work-flow by introducing a data science benchmark, Sanzu, and believes that data analysts and scientists would want to use a single system that can perform both data analysis tasks and SQL querying, without requiring data movement between different systems.
...
...

References

SHOWING 1-10 OF 65 REFERENCES
SciQL: array data processing inside an RDBMS
TLDR
This demo presents a proof of concept implementation of SciQL in the relational database system MonetDB, and demonstrates the storage of arrays in the Monet DB as first class citizens and the execution of a comprehensive set of basic operations on arrays.
Bridging Two Worlds with RICE Integrating R into the SAP In-Memory Computing Engine
TLDR
This work proposes an alternative data exchange mechanism with R, SQL-SHM, a shared memory-based data exchange to incorporate R’s vertical data structure and extended this approach to R-Op introducing R scripts equivalent to native database operations like join or aggregation within the execution plans.
Towards a unified architecture for in-RDBMS analytics
TLDR
This work proposes a unified architecture for in-database analytics that requires changes to only a few dozen lines of code to integrate a new statistical technique, and demonstrates the feasibility of this architecture by integrating several popular analytics techniques into two commercial and one open-source RDBMS.
LevelHeaded: A Unified Engine for Business Intelligence and Linear Algebra Querying
TLDR
This work presents a new in-memory query processing engine called LevelHeaded, which uses worst-case optimal joins as its core execution mechanism for both BI and LA queries and outperforms other relational database engines by orders of magnitude on standard LA benchmarks.
Vectorized UDFs in Column-Stores
TLDR
MonetDB/Python is presented, a new system that combines the open-source database MonetDB with the vector-based language Python and demonstrates efficiency gains of orders of magnitude.
Mind the Gap: Bridging Multi-Domain Query Workloads with EmptyHeaded
TLDR
This demonstration showcases the EmptyHeaded engine: an interactive query processing engine that leverages a novel query architecture to support efficient execution in multiple domains and highlights the strengths and weaknesses of this novel type of query processing architecture while showcasing its flexibility in multiple domain.
Efficient data management and statistics with zero-copy integration
TLDR
This paper argues that a zero-copy integration is feasible due to the omnipresence of C-style arrays containing native types and presents a prototype of this integration based on the columnar relational database MonetDB and the R environment for statistical computing.
Compiling mappings to bridge applications and databases
TLDR
This work presents a novel approach to this problem, in which the relationship between the application data and the persistent storage is specified using a declarative mapping, which is compiled into bidirectional views that drive the data transformation engine.
Design and Implementation of an Extensible Database Management System Supporting User Defined Data Types and Functions
TLDR
This paper describes an extension mechanism for data types and functions that has been implemented at the IBM Scientific Center in Heidelberg that is based upon HDBL, an SQL based query language for complex objects.
Scalable Linear Algebra on a Relational Database System
TLDR
The results should at least raise the possibility that brand new systems designed from the ground up to support scalable linear algebra are not absolutely necessary, and that such systems could instead be built on top of existing relational technology.
...
...