DiNoDB: An Interactive-Speed Query Engine for Ad-Hoc Queries on Temporary Data

  title={DiNoDB: An Interactive-Speed Query Engine for Ad-Hoc Queries on Temporary Data},
  author={Yongchao Tian and Ioannis Alagiannis and Erietta Liarou and Anastasia Ailamaki and Pietro Michiardi and Marko Vukolic},
  journal={IEEE Transactions on Big Data},
As data sets grow in size, analytics applications struggle to get instant insight into large datasets. Modern applications involve heavy batch processing jobs over large volumes of data and at the same time require efficient ad-hoc interactive analytics on temporary data. Existing solutions, however, typically focus on one of these two aspects, largely ignoring the need for synergy between the two. Consequently, interactive queries need to re-iterate costly passes through the entire dataset (e… 
In-situ visual exploration over big raw data
RawVis: Visual Exploration over Raw Data
This work introduces a framework, named RawVis, built on top of a lightweight in-memory tile-based index, VALINOR, that is constructed on-the-fly given the first user query over a raw file and adapted based on the user interaction.
Adaptive Indexing for In-situ Visual Exploration and Analytics
This work presents an adaptive indexing scheme that enables efficient visual exploration and analytics over big raw data files, and enables categorical-based analytics using group-by and filter operations.
RawVis: A System for Efficient In-situ Visual Analytics
RawVis provides real-time interaction, reporting low response time, over large data files, using commodity hardware, and implements novel indexing schemes and adaptive processing techniques allowing users to perform efficient visual and analytics operations directly over the data files.
Towards Dynamic Verifiable Pattern Matching
The proposed scheme is built on two ideas: one is to embed unique randomness to decouple the character and its index in the outsourced data, enabling efficient data updates; the other is to reduce the verifiable pattern matching problem to a discrete set membership testing problem, which relies on the decoupling introduced in the first idea.
Efficiency and Agility for a Modern Solution of Deterministic Multiple Source Prioritization and Validation Tasks
A modern rule-based, loosely coupled solution for multiple source prioritization and validation service that follows generalization, efficiency and agility principles in application design and shows the necessary level of attention in process implementation, data architectures and resource usage.
Big Data Visualization Tools
  • Nikos Bikakis
  • Computer Science, Art
    Encyclopedia of Big Data Technologies
  • 2019
Data visualization provides users with intuitive means to interactively explore and analyze data, enabling them to effectively identify interesting patterns, infer correlations and causalities, and supports sense-making activities.
Linked Data Visualization: Techniques, Tools, and Big Data
This paper presents a meta-modelling architecture that automates the very labor-intensive and therefore time-heavy and expensive process of manually cataloging and cataloging individual pieces of data to create a graph of their contents.


DiNoDB: Efficient Large-Scale Raw Data Analytics
This paper combines a MapReduce based platform with the recently proposed NoDB paradigm, which optimizes traditional centralized RDBMSs for in-situ queries of raw files to produce a new distributed data analytics system the authors call Distributed NoDB.
NoDB: efficient query execution on raw data files
The design and roadmap of a new paradigm in database systems, called NoDB, which do not require data loading while still maintaining the whole feature set of a modern database system are designed and implemented, bringing an unprecedented positive effect in usability and performance.
SCANRAW: A Database Meta-Operator for Parallel In-Situ Processing and Loading
This article proposes SCANRAW, a novel database meta-operator for in-situ processing over raw files that integrates data loading and external tables seamlessly, while preserving their advantages: optimal performance across a query workload and zero time-to-query.
Parallel in-situ data processing with speculative loading
The results show that SCANRAW with speculative loading achieves optimal performance for a query sequence at any point in the processing, and maximizes resource utilization for the entire workload execution while speculatively loading data and without interfering with normal query processing.
Discretized streams: fault-tolerant streaming computation at scale
D-Streams enable a parallel recovery mechanism that improves efficiency over traditional replication and backup schemes, and tolerates stragglers, and can easily be composed with batch and interactive query models like MapReduce, enabling rich applications that combine these modes.
Shark: SQL and rich analytics at scale
Shark is a new data analysis system that marries query processing with complex analytics on large clusters and extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL.
Invisible loading: access-driven data transfer from raw files into database systems
This paper describes a system that achieves the immediate gratification of running MapReduce jobs directly over a file system, while still making progress towards the long-term performance benefits of database systems.
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
This paper explores the feasibility of building a hybrid system that takes the best features from both technologies; the prototype built approaches parallel databases in performance and efficiency, yet still yields the scalability, fault tolerance, and flexibility of MapReduce-based systems.
MISO: souping up big data query processing with a multistore system
The method, called MISO for MultISstore Online tuning, is adaptive, lightweight, and works in an online fashion utilizing only the by-products of query processing, which are term as opportunistic views.
GLADE: big data analytics made easy
We present GLADE, a scalable distributed system for large scale data analytics. GLADE takes analytical functions expressed through the User-Defined Aggregate (UDA) interface and executes them