Processing a Trillion Cells per Mouse Click

@article{Hall2012ProcessingAT,
  title={Processing a Trillion Cells per Mouse Click},
  author={Alexander Hall and Olaf Bachmann and Robert B{\"u}ssow and Silviu-Ionut Ganceanu and Marc Nunkesser},
  journal={Proc. VLDB Endow.},
  year={2012},
  volume={5},
  pages={1436-1446}
}
Column-oriented database systems have been a real game changer for the industry in recent years. Highly tuned and performant systems have evolved that provide users with the possibility of answering ad hoc queries over large datasets in an interactive manner. In this paper we present the column-oriented datastore developed as one of the central components of PowerDrill. It combines the advantages of columnar data layout with other known techniques (such as using composite range partitions… 

Figures and Tables from this paper

LEA: A Learned Encoding Advisor for Column Stores

Learned Encoding Advisor (LEA) is introduced, a learned approach to column encoding selection that achieves 19% lower query latency while using 26% less space than a commercial column store on TPC-H.

Cubrick: Indexing Millions of Records per Second for Interactive Analytics

Details about Cubrick's internal data structures, distributed model, query execution engine and a few details about the current implementation are described and results from a thorough experimental evaluation that leveraged datasets and queries collected from a few internal Cubrick deployments at Facebook are presented.

Hillview: A trillion-cell spreadsheet for big data

Hillview is a distributed spreadsheet for browsing very large datasets that cannot be handled by a single machine. As a spreadsheet, Hillview provides a high degree of interactivity that permits data

An Improved Dynamic Vertical Partitioning Technique for Semi-Structured Data

  • Sahel SharifyA. W. Lu C. Amza
  • Computer Science
    2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
  • 2019
This paper addresses the challenges of relational support for JSON data through a lightweight, in-memory relational database engine prototype and a flexible vertical partitioning algorithm that uses simple heuristics to adapt the data layout for the workload, on the fly.

Skipping-oriented Partitioning for Columnar Layouts

This paper develops Generalized Skipping-Oriented Partitioning (GSOP), a novel hybrid data skipping framework that takes into account these row-based and column-based tradeoffs.

F1 Query: Declarative Querying at Scale

This paper presents the end-to-end design of F1 Query, a stand-alone, federated query processing platform that executes SQL queries against data stored in different file-based formats as well as different storage systems at Google.

Distributed and interactive cube exploration

DICE is introduced, a distributed system that uses a novel session-oriented model for data cube exploration, designed to provide the user with interactive sub-second latencies for specified accuracy levels.

Sampling-based Techniques for Interactive Exploration of Large Datasets

This work uses approaches such as speculative execution, data sampling, faceted exploration, and scan sharing towards this end of interactive exploration of large-scale datasets within interactive response times with projects such as DICE, Sesame, FluxQuery, and a unified join sampling approach.

Pinot: Realtime OLAP for 530 Million Users

Pinot is presented, a single system used in production at Linkedin that can serve tens of thousands of analytical queries per second, offers near-realtime data ingestion from streaming data sources, and handles the operational requirements of large web properties.

Small Summaries for Big Data

This comprehensive introduction to data summarization, aimed at practitioners and students, showcases the algorithms, their behavior, and the mathematical underpinnings of their operation that have been incorporated in systems from companies such as Google, Apple, Microsoft, Netflix and Twitter.
...

References

SHOWING 1-10 OF 34 REFERENCES

MonetDB/X100: Hyper-Pipelining Query Execution

An in-depth investigation to the reason why database systems tend to achieve only low IPC on modern CPUs in compute-intensive application areas, and a new set of guidelines for designing a query processor for the MonetDB system that follows these guidelines.

C-Store: A Column-oriented DBMS

Preliminary performance data on a subset of TPC-H is presented and it is shown that the system the team is building, C-Store, is substantially faster than popular commercial products.

Integrating compression and execution in column-oriented database systems

This paper shows how compression schemes not traditionally used in row-oriented DBMSs can be applied to column-oriented systems and evaluates a set of compression schemes and shows that the best scheme depends not only on the properties of the data but also on the nature of the query workload.

Query execution in column-oriented database systems

This dissertation provides (to the best of the knowledge) the only detailed study of multiple implementation approaches of such systems, categorizing the different approaches into three broad categories, and evaluating the tradeoffs between approaches.

Column oriented Database Systems

This tutorial presents an overview of column-oriented database system technology and addresses questions about how easily a major row-based system achieve column-store performance and the new applications that can be potentially enabled by column-stores.

Column-stores vs. row-stores: how different are they really?

It is concluded that while it is not impossible for a row-store to achieve some of the performance advantages of a column-store, changes must be made to both the storage layer and the query executor to fully obtain the benefits of aColumn-oriented approach.

Brighthouse: an analytic data warehouse for ad-hoc queries

Additional benefits resulting from Knowledge Grid for compressed, column-oriented databases, including assistance in query optimization and execution, are demonstrated by minimizing the need of data reads and data decompression.

Self-organizing tuple reconstruction in column-stores

A novel design, partial sideways cracking, is proposed that achieves performance similar to using presorted data, but without requiring the heavy initial presorting step itself, and brings significant performance benefits for multi-attribute queries.

Challenges in building large-scale information retrieval systems: invited talk

  • J. Dean
  • Computer Science
    WSDM '09
  • 2009
This talk will discuss the evolution of Google's hardware infrastructure and information retrieval systems and some of the design challenges that arise from ever-increasing demands in all of these dimensions.