Pinot: Realtime OLAP for 530 Million Users

@article{Im2018PinotRO,
  title={Pinot: Realtime OLAP for 530 Million Users},
  author={Jean-François Im and Kishore Gopalakrishna and Subbu Subramaniam and Mayank Shrivastava and Adwait Tumbde and Xiaotian Jiang and Jennifer Dai and Seunghyun Lee and Neha Pawar and Jialiang Li and Ravi Aringunram},
  journal={Proceedings of the 2018 International Conference on Management of Data},
  year={2018}
}
Modern users demand analytical features on fresh, real time data. Offering these analytical features to hundreds of millions of users is a relevant problem encountered by many large scale web companies. Relational databases and key-value stores can be scaled to provide point lookups for a large number of users but fall apart at the combination of high ingest rates, high query rates at low latency for analytical queries. Online analytical databases typically rely on bulk data loads and are not… 

Cool, a COhort OnLine analytical processing system

TLDR
The evaluation results show that Cool outperforms two state-of-the-art systems, MonetDB and Druid, by a wide margin in single-node setting and can also beat the distributed Druid, as well as SparkSQL, by one order of magnitude in terms of query latency.

From Batch Processing to Real Time Analytics: Running Presto® at Scale

TLDR
How Presto provides unified SQL on heterogeneous storage systems without data copy; how Presto deals with complex data, including nested columnar data and schema evolution; How Presto supports geospatial queries efficiently, and how file list cache works in Presto.

Meta’s Next-generation Realtime Monitoring and Analytics Platform

TLDR
The next generation of Scuba’s architecture is presented, codenamed Kraken, which decouples storage management from the query serving system and introduces a single, durable source of truth, which enables tangible improvements to system fault tolerance and query performance while still respecting tolerable bounds of client observed data freshness.

LogStore: A Cloud-Native and Multi-Tenant Log Database

TLDR
The cloud-native log database LogStore is proposed, which combines shared-nothing and shared-data architecture, and utilizes highly scalable and low-cost cloud object storage, while overcoming the bandwidth limitations and high latency of using remote storage when writing a large number of logs.

Data-Parallel Actors: A Programming Model for Scalable Query Serving Systems

TLDR
DPA is used to build a new query serving system, a simplified data warehouse based on the single-node database MonetDB, and enhance existing ones, such as Druid, Solr, and MongoDB, adding missing user-requested features such as load balancing and elasticity.

AggNet: Cost-Aware Aggregation Networks for Geo-distributed Streaming Analytics

TLDR
This work proposes aggregation networks for performing aggregation on a geo-distributed edge-cloud infrastructure consisting of edge servers, transit and destination DCs, and implements an efficient, near-optimal practical heuristic in AggNet, built on top of Apache Flink.

Growth of relational model: Interdependence and complementary to big data

TLDR
This paper aims to provide a complete model of a relational database that is still being widely used because of its well known ACID properties namely, atomicity, consistency, integrity and durability, to highlight the adoption of relational model approaches by bigdata techniques.

CoopStore: Optimizing Precomputed Summaries for Aggregation

TLDR
CoopStore is introduced, a query system that optimizes item frequency and quantile summaries for accuracy when aggregating over multiple segments and leverages additional memory available for summary construction and aggregation to derive a more precise combined result.

Real-time Data Infrastructure at Uber

TLDR
The overall architecture of the real-time data infrastructure is presented and three scaling challenges are identified that the architecture needs to continuously address for each component in the architecture.

Alibaba hologres

TLDR
This work proposes Hologres, which is a cloud native service for hybrid serving and analytical processing (HSAP), which decouples the computation and storage layers, allowing flexible scaling in each layer, and proposes Execution Context as a resource abstraction between system threads and user tasks.

References

SHOWING 1-10 OF 33 REFERENCES

The MemSQL Query Optimizer: A modern optimizer for real-time analytics in a distributed database

TLDR
The architecture of the MemSQL Query Optimizer is described and the design choices and innovations which enable it to quickly produce highly efficient execution plans for complex distributed queries are described.

Untangling cluster management with Helix

TLDR
The Helix design and implementation is described and an experimental study that demonstrates its performance and functionality is presented, which detail several Helix-managed production distributed systems at LinkedIn and how Helix has helped them avoid building custom management components.

Kafka : a Distributed Messaging System for Log Processing

TLDR
This work introduces Kafka, a distributed messaging system that was developed for collecting and delivering high volumes of log data with low latency, and shows that Kafka has superior performance when compared to two popular messaging systems.

Druid: a real-time analytical data store

TLDR
Druid's architecture is described, and how it supports fast aggregations, flexible filters, and low latency data ingestion is detailed.

Optimizing Druid with Roaring bitmaps

TLDR
An extensive series of experiments is produced in order to compare Roaring bitmap and Concise time-space performances when used to accelerate Druid's OLAP queries and other kinds of operations Druid realizes on bitmaps, like: retrieving set bits from bit maps, computing bitmap complements, aggregating several bitmaps with logical ORs and ANDs operations.

Big Data: Principles and best practices of scalable realtime data systems

TLDR
Big Data describes a scalable, easy to understand approach to big data systems that can be built and run by a small team that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data.

MonetDB/X100: Hyper-Pipelining Query Execution

TLDR
An in-depth investigation to the reason why database systems tend to achieve only low IPC on modern CPUs in compute-intensive application areas, and a new set of guidelines for designing a query processor for the MonetDB system that follows these guidelines.

A Multidimensional OLAP Engine Implementation in Key-Value Database Systems

This paper tries to explore the capabilities of MapReduce-like execution engines for multidimensional data analytics through implementing a Multidimensional Online Analytical Processing (MOLAP)

DB2 with BLU Acceleration: So Much More than Just a Column Store

TLDR
Full integration with DB2 ensures that DB2 with BLU Acceleration benefits from the full functionality and robust utilities of a mature product, while still enjoying order-of-magnitude performance gains from revolutionary technology without even having to change the SQL.

Computing Iceberg Queries Efficiently

TLDR
This work proposes efficient algorithms to evaluate iceberg queries using very little memory and significantly fewer passes over data, as compared to current techniques that use sorting or hashing.